monitord ... know how happy your systemd is! 😊



monitord ... know how happy your systemd is! 😊

Requirements

What does monitord monitor?

monitord collects systemd health metrics via D-Bus (and optionally Varlink) and outputs them as JSON. It provides visibility into:

Run Modes

We offer the following run modes:

Open to more formats / run methods ... Open an issue to discuss. Depends on the dependencies basically.

monitord is a config driven binary. We plan to keep CLI arguments to a minimum.

INFO level logging is enabled to stderr by default. Use -l LEVEL to increase or decrease logging.

Quick Start

  1. Install monitord: bash cargo install monitord

  2. Create a minimal config at /etc/monitord.conf: ```ini [monitord] output_format = json-pretty

[units] enabled = true

[pid1] enabled = true ```

  1. Run it: bash monitord

This will collect unit counts and PID 1 stats, then print JSON to stdout and exit. Enable additional collectors in the config as needed (see Configuration below).

Install

Pre-built binaries

Download pre-built binaries from GitHub Releases:

# Example: download and install the latest release (x86_64)
curl -L -o /usr/local/bin/monitord \
  https://github.com/cooperlees/monitord/releases/latest/download/monitord-linux-amd64
chmod +x /usr/local/bin/monitord

From crates.io

Install via cargo or use as a dependency in your Cargo.toml.

crl-linux:monitord cooper$ monitord --help
monitord: Know how happy your systemd is! 😊

Usage: monitord [OPTIONS]

Options:
  -c, --config <CONFIG>
          Location of your monitord config

          [default: /etc/monitord.conf]

  -l, --log-level <LOG_LEVEL>
          Adjust the console log-level

          [default: Info]
          [possible values: error, warn, info, debug, trace]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Config

monitord can have the different components monitored. To enable / disabled set the following in our monitord.conf. This file is ini format to match systemd unit files.

# Pure ini - no yes/no for bools

[monitord]
# Set a custom dbus address to connect to
# OPTIONAL: If not set, we default to the Unix socket below
dbus_address = unix:path=/run/dbus/system_bus_socket
# Timeout in seconds for dbus connection/collections
# OPTIONAL: default is 30 seconds
dbus_timeout = 30
# Run as a daemon or 1 time
daemon = false
# Time to refresh systemd stats in seconds
# Daemon mode only
daemon_stats_refresh_secs = 60
# Prefix flat-json key with this value
# The value automatically gets a '.' appended (so don't put here)
key_prefix = monitord
# cron/systemd timer output format
# Supported: json, json-flat, json-pretty
output_format = json

# Grab as much stats from DBus GetStats call
# we can from running dbus daemon
# More tested on dbus-broker daemon
[dbus]
# Summary counters - both dbus-broker + dbus-daemon
enabled = false
# dbus.user.* metrics: user stats as reported by dbus-broker
user_stats = false
# dbus.oeer.* metrics: peer stats as reported by dbus-broker
peer_stats = false
# dbus.cgroup.* stats is an aggregation of peer_stats by cgroup
# by dbus-broker
cgroup_stats = false

# Grab networkd stats from files + networkctl
[networkd]
enabled = true
link_state_dir = /run/systemd/netif/links

# Enable grabbing PID 1 stats via procfs
[pid1]
enabled = true

# Services to grab extra stats for
# .service is important as that's what DBus returns from `list_units`
[services]
foo.service

[timers]
enabled = true

[timers.allowlist]
foo.timer

[timers.blocklist]
bar.timer

# Grab unit status counts via dbus
[units]
enabled = true
state_stats = true

# Filter what services you want collect state stats for
# If both lists are configured blocklist is preferred
# If neither exist all units state will generate counters
[units.state_stats.allowlist]
foo.service

[units.state_stats.blocklist]
bar.service

# machines config
[machines]
enabled = true

# Same rules apply as state_stats lists above
[machines.allowlist]
foo

[machines.blocklist]
bar

# Boot blame metrics - shows the N slowest units at boot
# Similar to `systemd-analyze blame`
# Disabled by default
[boot]
enabled = false
# Cache boot blame stats in /run/monitord/<boot_id>.boot_blame.bin
# Enabled by default; set false to force recalculation every run
cache_enabled = true
# Number of slowest units to report
num_slowest_units = 5

# Optional: only include specific units in boot blame (if empty, all units are checked)
# Same rules apply as state_stats lists above
[boot.allowlist]
# slow-startup.service

# Optional: exclude specific units from boot blame
[boot.blocklist]
# noisy-but-expected.service

# Unit verification using systemd-analyze verify
# Disabled by default as it can be slow on large systems
[verify]
enabled = false

# Optional: only verify specific units (if empty, all units are checked)
[verify.allowlist]
# example.service
# example.timer

# Optional: skip verification for specific units
[verify.blocklist]
# noisy.service
# broken.timer

When using the provided monitord.service, systemd creates /run/monitord via RuntimeDirectory=monitord and assigns ownership to the configured service User/Group. If you run monitord another way, ensure /run/monitord exists and is writable by the monitord process user so boot cache files can be created.

Machines support

From version >=0.11 monitord supports obtaining the same set of key from systemd 'machines' (i.e. machinectl --list).

The keys are the same format as below in json_flat output but are prefixed with the machine keyword and machine name. For example:

# $KEY_PREFIX.machine.$MACHINE_NAME
{
  ...
  "monitord.machine.foo.pid1.fd_count": 69,
  ...
}

Output Formats

json

Normal serde_json non pretty JSON. All on one line. Most compact format.

json-flat

Move all key value pairs to the top level and . notate components + sub values. Is semi pretty too + custom. All unittested ...

stat_collection_run_time_ms is emitted in milliseconds (with _ms suffix) to follow Prometheus metric naming conventions for duration units, which keeps unit semantics clear and consistent when these keys are transformed into Prometheus metric names.

{
  "boot.blame.dnf5-automatic.service": 204.159,
  "boot.blame.cpe_chef.service": 103.05,
  "boot.blame.sys-module-fuse.device": 16.21,
  "boot.blame.dev-ttyS0.device": 15.809,
  "boot.blame.systemd-networkd-wait-online.service": 1.674,
  "collection_timings.list_units_ms": 5.26,
  "collection_timings.per_unit_loop_ms": 42.99,
  "collection_timings.service_dbus_fetches": 0,
  "collection_timings.state_dbus_fetches": 0,
  "collection_timings.timer_dbus_fetches": 24,
  "collector_timings.boot_blame.elapsed_ms": 53.36,
  "collector_timings.boot_blame.start_offset_ms": 0.08,
  "collector_timings.boot_blame.success": 1,
  "collector_timings.units.elapsed_ms": 53.24,
  "collector_timings.units.start_offset_ms": 0.06,
  "collector_timings.units.success": 1,
  "dbus.active_connections": 10,
  "dbus.bus_names": 16,
  "dbus.incomplete_connections": 0,
  "dbus.match_rules": 26,
  "dbus.peak_bus_names": 33,
  "dbus.peak_bus_names_per_connection": 2,
  "dbus.peak_match_rules": 33,
  "dbus.peak_match_rules_per_connection": 13,
  "dbus.cgroup.system.slice-systemd-logind.service.activation_request_bytes": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.activation_request_fds": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.incoming_bytes": 16,
  "dbus.cgroup.system.slice-systemd-logind.service.incoming_fds": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.match_bytes": 6942,
  "dbus.cgroup.system.slice-systemd-logind.service.matches": 5,
  "dbus.cgroup.system.slice-systemd-logind.service.name_objects": 1,
  "dbus.cgroup.system.slice-systemd-logind.service.outgoing_bytes": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.outgoing_fds": 0,
  "dbus.cgroup.system.slice-systemd-logind.service.reply_objects": 0,
  "dbus.peer.org.freedesktop.systemd1.activation_request_bytes": 0,
  "dbus.peer.org.freedesktop.systemd1.activation_request_fds": 0,
  "dbus.peer.org.freedesktop.systemd1.incoming_bytes": 16,
  "dbus.peer.org.freedesktop.systemd1.incoming_fds": 0,
  "dbus.peer.org.freedesktop.systemd1.match_bytes": 46533,
  "dbus.peer.org.freedesktop.systemd1.matches": 33,
  "dbus.peer.org.freedesktop.systemd1.name_objects": 1,
  "dbus.peer.org.freedesktop.systemd1.outgoing_bytes": 0,
  "dbus.peer.org.freedesktop.systemd1.outgoing_fds": 0,
  "dbus.peer.org.freedesktop.systemd1.reply_objects": 0,
  "dbus.user.cooper.bytes": 919236,
  "dbus.user.cooper.fds": 78,
  "dbus.user.cooper.matches": 510,
  "dbus.user.cooper.objects": 80,
  "networkd.eno4.address_state": 3,
  "networkd.eno4.admin_state": 4,
  "networkd.eno4.carrier_state": 5,
  "networkd.eno4.ipv4_address_state": 3,
  "networkd.eno4.ipv6_address_state": 2,
  "networkd.eno4.oper_state": 9,
  "networkd.eno4.required_for_online": 1,
  "networkd.managed_interfaces": 2,
  "networkd.wg0.address_state": 3,
  "networkd.wg0.admin_state": 4,
  "networkd.wg0.carrier_state": 5,
  "networkd.wg0.ipv4_address_state": 3,
  "networkd.wg0.ipv6_address_state": 3,
  "networkd.wg0.oper_state": 9,
  "networkd.wg0.required_for_online": 1,
  "pid1.cpu_time_kernel": 48,
  "pid1.cpu_user_kernel": 41,
  "pid1.fd_count": 245,
  "pid1.memory_usage_bytes": 19165184,
  "pid1.tasks": 1,
  "services.chronyd.service.active_enter_timestamp": 1683556542382710,
  "services.chronyd.service.active_exit_timestamp": 0,
  "services.chronyd.service.cpuusage_nsec": 328951000,
  "services.chronyd.service.inactive_exit_timestamp": 1683556541360626,
  "services.chronyd.service.ioread_bytes": 18446744073709551615,
  "services.chronyd.service.ioread_operations": 18446744073709551615,
  "services.chronyd.service.memory_available": 18446744073709551615,
  "services.chronyd.service.memory_current": 5214208,
  "services.chronyd.service.nrestarts": 0,
  "services.chronyd.service.restart_usec": 100000,
  "services.chronyd.service.state_change_timestamp": 1683556542382710,
  "services.chronyd.service.status_errno": 0,
  "services.chronyd.service.tasks_current": 1,
  "services.chronyd.service.timeout_clean_usec": 18446744073709551615,
  "services.chronyd.service.watchdog_usec": 0,
  "stat_collection_run_time_ms": 87.4013,
  "system-state": 3,
  "timers.fstrim.timer.accuracy_usec": 3600000000,
  "timers.fstrim.timer.fixed_random_delay": 0,
  "timers.fstrim.timer.last_trigger_usec": 1743397269608978,
  "timers.fstrim.timer.last_trigger_usec_monotonic": 0,
  "timers.fstrim.timer.next_elapse_usec_monotonic": 0,
  "timers.fstrim.timer.next_elapse_usec_realtime": 1744007133996149,
  "timers.fstrim.timer.persistent": 1,
  "timers.fstrim.timer.randomized_delay_usec": 6000000000,
  "timers.fstrim.timer.remain_after_elapse": 1,
  "timers.fstrim.timer.service_unit_last_state_change_usec": 1743517244700135,
  "timers.fstrim.timer.service_unit_last_state_change_usec_monotonic": 639312703,
  "unit_states.chronyd.service.active_state": 1,
  "unit_states.chronyd.service.loaded_state": 1,
  "unit_states.chronyd.service.unhealthy": 0,
  "units.activating_units": 0,
  "units.active_units": 403,
  "units.automount_units": 1,
  "units.device_units": 150,
  "units.failed_units": 0,
  "units.inactive_units": 159,
  "units.jobs_queued": 0,
  "units.loaded_units": 497,
  "units.masked_units": 25,
  "units.mount_units": 52,
  "units.not_found_units": 38,
  "units.path_units": 4,
  "units.scope_units": 17,
  "units.service_units": 199,
  "units.slice_units": 7,
  "units.socket_units": 28,
  "units.target_units": 54,
  "units.timer_units": 20,
  "units.total_units": 562,
  "verify.failing.device": 43,
  "verify.failing.mount": 15,
  "verify.failing.service": 31,
  "verify.failing.slice": 1,
  "verify.failing.total": 97,
  "version": "255.7-1.fc40"
}

json-pretty

Normal serde_json pretty representations of each components structs.

Per-collector timing metrics

monitord records the wall time each collector future spends inside a single stat_collector cycle and exposes the result on MonitordStats::collector_timings, plus an inner phase breakdown for the units collector (SystemdUnitStats::collection_timings).

Field Meaning
collector_timings.<name>.start_offset_ms ms from the top of the cycle until the spawned future was first polled. Should be sub-ms when collectors are running in parallel; a non-trivial value means the spawn loop or runtime is delaying first poll.
collector_timings.<name>.elapsed_ms ms from first poll to completion for that collector.
collector_timings.<name>.success 1 if the collector returned Ok, 0 otherwise.
collection_timings.list_units_ms ms for the systemd ListUnits D-Bus call (one batched call).
collection_timings.per_unit_loop_ms ms spent walking each listed unit, including any per-unit D-Bus calls (timer/state/service).
collection_timings.timer_dbus_fetches Count of timer D-Bus property fetches this run.
collection_timings.state_dbus_fetches Count of unit-state D-Bus fetches (only when state_stats_time_in_state is enabled).
collection_timings.service_dbus_fetches Count of per-service D-Bus property fetches.

Comparing sum(collector_timings.*.elapsed_ms) against stat_collection_run_time_ms gives an effective parallelism ratio (sum / wall ≈ N means N-way parallelism, ≈ 1 means effectively serial).

The per-collector lines are also emitted to logs at debug! level. The end-of-cycle "stat collection run took {}ms" summary stays at info!.

Varlink-vs-D-Bus parity

collection_timings is populated identically by the D-Bus path (units::parse_unit_state) and the varlink path (varlink_units::parse_metrics). In the varlink case, list_units_ms is the bulk varlink List call on io.systemd.Manager and per_unit_loop_ms is the local parse loop; the *_dbus_fetches counters stay at zero, which is itself a useful signal that the varlink path is not paying per-unit D-Bus cost. This makes varlink.enabled = true vs false directly comparable on the same host.

Convention for new collectors moved to varlink: when porting a collector from D-Bus to varlink, add the equivalent inner timings so the two implementations remain comparable. The minimum is wall time of the bulk fetch (analogous to list_units_ms) and the local parse loop (analogous to per_unit_loop_ms), recorded onto a struct nested inside the collector's public stats type. Single-shot varlink calls (e.g. networkd Describe) do not need an inner split — the outer collector_timings.<name>.elapsed_ms already covers them.

Metric Value Reference

Many metrics are serialized as integers. Here are the enum mappings:

system-state

Value State
0 unknown
1 initializing
2 starting
3 running
4 degraded
5 maintenance
6 stopping
7 offline

active_state (unit_states.*.active_state)

Value State
0 unknown
1 active
2 reloading
3 inactive
4 failed
5 activating
6 deactivating

loaded_state (unit_states.*.loaded_state)

Value State
0 unknown
1 loaded
2 error
3 masked
4 not-found

networkd address_state / ipv4_address_state / ipv6_address_state

Value State
0 unknown
1 off
2 degraded
3 routable

networkd admin_state

Value State
0 unknown
1 pending
2 failed
3 configuring
4 configured
5 unmanaged
6 linger

networkd carrier_state

Value State
0 unknown
1 off
2 no-carrier
3 dormant
4 degraded-carrier
5 carrier
6 enslaved

networkd oper_state

Value State
0 unknown
1 missing
2 off
3 no-carrier
4 dormant
5 degraded-carrier
6 carrier
7 degraded
8 enslaved
9 routable

dbus stats

You're going to need to be root or allow permissiong to pull dbus stats. For dbus-broker here is example config allow a user monitord to query getStats

[cooper@l33t ~]# cat /etc/dbus-1/system.d/allow_monitord_stats.conf
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE busconfig PUBLIC
 "-//freedesktop//DTD D-BUS Bus Configuration 1.0//EN"
 "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd">
<busconfig>
  <policy user="monitord">
    <allow send_destination="org.freedesktop.DBus"
           send_interface="org.freedesktop.DBus.Debug.Stats"
           send_member="GetStats"
           send_path="/org/freedesktop/DBus"
           send_type="method_call"/>
  </policy>
</busconfig>

Development

To do test runs (requires systemd and systemd-networkd installed)

Ensure the following pass before submitting a PR (CI checks):

Releasing a new version

  1. Increment the version in Cargo.toml
  2. Run ./build_docs.sh to regenerate docs
  3. Commit with message: Move to version X.Y.Z for release + update docs
  4. If you have commit bit, push directly to main. Otherwise, push a branch and open a PR.
  5. Cut a GitHub release: gh release create X.Y.Z --title "X.Y.Z" --generate-notes

Generate codegen APIs

Then add the following macros to tell clippy to go away:

#![allow(warnings)]
#![allow(clippy)]

Non Linux development

Sometimes I develop from my Mac OS X laptop. So I thought I'd document and add the way I build a Fedora Rawhide container and mount the local repo to /repo in the container to run monitord and test.

You can now log into the container to build + run tests and run the binary now against systemd.

Troubleshooting

"Connection refused" or D-Bus connection errors

Ensure the system D-Bus daemon is running and the socket exists at /run/dbus/system_bus_socket. If using a custom address, set dbus_address in [monitord] config. Increase dbus_timeout if running on slow systems.

Empty or missing networkd metrics

systemd-networkd must be installed and running (systemctl start systemd-networkd). If networkd is not in use on your system, disable the collector with enabled = false in [networkd].

Permission denied for D-Bus stats

The [dbus] collector requires permission to call org.freedesktop.DBus.Debug.Stats.GetStats. Either run monitord as root or add a D-Bus policy file — see the dbus stats section.

PID 1 stats unavailable

PID 1 stats require Linux with procfs mounted at /proc. This collector is compiled out on non-Linux targets. If /proc is not available (some container runtimes), disable with enabled = false in [pid1].

Collector errors don't crash monitord

When an individual collector fails (e.g., networkd not running, D-Bus timeout), monitord logs a warning and continues with the remaining collectors. Check stderr output or increase the log level (-l debug) to see which collectors had issues.

Large u64 values (18446744073709551615) in output

These represent u64::MAX and mean "not available" or "not tracked" for that metric. This is how systemd reports fields that are unsupported or not configured for the unit (e.g., memory_available when MemoryMax= is not set).

Library API

monitord can be used as a Rust library. See the full API documentation at monitord.xyz.

DBus

All monitord's dbus is done via async (tokio) zbus crate.

systemd Dbus APIs are in use in the following modules:

Some of these modules can be disabled via configuration. Due to this, monitord might not always be running / calling all these DBus calls per run.

Varlink

monitord supports collecting unit statistics via systemd's Varlink metrics API, available in systemd v260+. When enabled, monitord connects to the io.systemd.Metrics interface at /run/systemd/report/io.systemd.Manager to collect unit counts, active/load states, and restart counts.

Enabling Varlink

Set enabled = true in the [varlink] section of monitord.conf:

[varlink]
enabled = true

When varlink is enabled, monitord will attempt to collect stats via the varlink APIs first, automatically falling back to D-Bus or file-based collection when a varlink socket is unavailable (e.g., older systemd versions).

Metrics collected via Varlink

Units (io.systemd.Metrics — systemd v260+): - Unit counts by type (service, mount, socket, target, device, automount, timer, path, slice, scope) - Unit counts by state (active, failed, inactive) - Per-unit active state and load state (with allowlist/blocklist filtering) - Per-unit health status (computed from active + load state) - Per-service restart counts (nrestarts) - Falls back to D-Bus collection if the socket is unavailable

Networkd interfaces (io.systemd.Network.Describe — systemd v257+): - Per-interface operational, carrier, admin, and address states - Falls back to parsing /run/systemd/netif/links state files if the socket is unavailable

Containers

For systemd-nspawn containers, monitord connects to the container's varlink socket via /proc/<leader_pid>/root/run/systemd/report/io.systemd.Manager, similar to how D-Bus uses the container-scoped bus socket. Networkd stats use /proc/<leader_pid>/root/run/systemd/netif/io.systemd.Network, with the same file-based fallback.

varlink 101

varlink might one day replace our DBUS usage. Here are some notes on how to work with systemd varlink as there isn't really documentation outside man pages.

Checking interfaces

Here is an example with networkd's interfaces:

varlinkctl info unix:/run/systemd/netif/io.systemd.Network
varlinkctl introspect unix:/run/systemd/netif/io.systemd.Network io.systemd.Network

cooper@au:~$ varlinkctl call unix:/run/systemd/netif/io.systemd.Network io.systemd.Network.GetStates '{}' -j | jq
{
  "AddressState": "routable",
  "IPv4AddressState": "routable",
  "IPv6AddressState": "routable",
  "CarrierState": "carrier",
  "OnlineState": "online",
  "OperationalState": "routable"
}