check-journald

Icinga/Nagios checkscript of systemd journal entries

check_journald.sh

License: MIT Built by NEMESTER

Nagios / Icinga plugin to count error-level log entries in systemd journald over a configurable time window.


Table of Contents


Description

check_journald.sh queries journalctl for log entries at or above a configurable severity level within a fixed lookback window and alerts based on count thresholds.

It evaluates:

  • Error count — number of log lines at or above the configured priority level
  • Unit scope — optionally restricted to one or more specific systemd units
  • Exclusion filter — optional regex to suppress known-noisy patterns
  • Context lines — recent matching log lines are included in the plugin output for fast triage

Requirements

  • Linux system with systemd / journald
  • journalctl available in PATH
  • The monitoring user must have read access to the journal:
    usermod -aG systemd-journal nagios
  • bash, grep
  • Nagios / Icinga (or compatible monitoring system)

Installation

  1. Copy the script to your plugin directory:

    /usr/lib/nagios/plugins/check_journald.sh
  2. Set executable permissions:

    chmod 755 /usr/lib/nagios/plugins/check_journald.sh
  3. Grant journal read access to the monitoring user:

    usermod -aG systemd-journal nagios

Usage

check_journald.sh [--since ] [--unit ] [--priority ]
                         [--warn ] [--crit ] [--show-lines ]
                         [--exclude ]

Parameters

Parameter Description
--since Journald lookback window (default: 1h). Accepts any journalctl --since value, e.g. 30m, 2h, 1d
--unit Filter to a specific systemd unit. Can be repeated for multiple units
--priority Minimum log priority to count (default: err). All entries at this level and more severe are counted
--warn Warning threshold for matching entry count (default: 1)
--crit Critical threshold for matching entry count (default: 10)
--show-lines Number of recent matching lines to append to output for triage (default: 3, set to 0 to disable)
--exclude Extended regex (grep -E) to exclude matching lines from the count

Default Values

Parameter Default
since 1h
priority err
warn 1
crit 10
show-lines 3

Priority Levels

Journald uses syslog-compatible priority levels. The --priority flag sets the minimum severity to count — all entries at that level and above (more severe) are included.

Level Name Typical use
0 emerg System is unusable
1 alert Immediate action required
2 crit Critical condition
3 err Error condition (default)
4 warning Warning condition
5 notice Normal but significant
6 info Informational
7 debug Debug messages

Threshold Semantics

Metric Behavior
Log entry count higher is worse

Example with --warn 1 --crit 10:

  • 0 entries → OK
  • 1–9 entries → WARNING
  • ≥ 10 entries → CRITICAL

Example Usage

# System-wide, err+ in last 1h (defaults)
check_journald.sh

# Shorter window, lower threshold
check_journald.sh --since 30m --warn 1 --crit 5

# Monitor a specific unit
check_journald.sh --unit sshd --priority err --warn 1 --crit 5

# Monitor multiple units
check_journald.sh --unit nginx.service --unit php-fpm.service --since 2h

# Catch warnings too, with tighter thresholds
check_journald.sh --priority warning --warn 10 --crit 50

# Exclude known-noisy patterns
check_journald.sh --unit postfix.service --exclude "Connection reset by peer|Timeout"

# No context lines in output (cleaner for some dashboards)
check_journald.sh --show-lines 0 --warn 5 --crit 20

Example Output

[OK]: 0 err+ log entries in last 1h [system-wide] | log_errors=0;1;10;0;
[WARNING]: 3 err+ log entries in last 1h [system-wide] | log_errors=3;1;10;0;
  2025-10-01T08:12:44+0200 myhost kernel: EXT4-fs error (device sdb1)
  2025-10-01T08:13:01+0200 myhost sshd[2341]: error: Could not load host key
  2025-10-01T08:14:22+0200 myhost postfix[9812]: error: connect to smtp.example.com
[CRITICAL]: 14 err+ log entries in last 1h [unit=nginx.service] | log_errors=14;1;10;0;
  2025-10-01T08:19:11+0200 myhost nginx[441]: [error] connect() failed (111: Connection refused)
  2025-10-01T08:19:14+0200 myhost nginx[441]: [error] connect() failed (111: Connection refused)
  2025-10-01T08:19:17+0200 myhost nginx[441]: [error] no live upstreams while connecting to upstream
[OK]: 0 err+ log entries in last 2h [units=nginx.service,php-fpm.service] (exclude: Timeout) | log_errors=0;1;10;0;

Return Codes

Code State
0 OK
1 WARNING
2 CRITICAL
3 UNKNOWN

NRPE Integration

Example /etc/nagios/nrpe.cfg:

# System-wide error check
command[check_journald]=/usr/lib/nagios/plugins/check_journald.sh --since 1h --warn 1 --crit 10

# Per-unit check
command[check_journald_nginx]=/usr/lib/nagios/plugins/check_journald.sh --unit nginx.service --warn 1 --crit 5

Or with dynamic arguments:

command[check_journald]=/usr/lib/nagios/plugins/check_journald.sh $ARG1$

(Restart NRPE after changes)


Notes

  • The --since value is passed directly to journalctl --since " ago". Supported formats follow journald's time syntax: 30m, 1h, 2h, 1d
  • The monitoring user needs to be a member of the systemd-journal group to read the journal without root. Add with usermod -aG systemd-journal nagios
  • --warn 1 (default) means a single error in the window triggers a WARNING. Raise this threshold for noisy services where some errors are expected
  • --exclude is applied after journalctl filtering — it does not affect what journald returns, only what gets counted
  • The --show-lines context is appended as extra lines after the main status line, compatible with Icinga2's $output$ and $long_output$ variables
  • When multiple --unit flags are given, journalctl filters by any of the specified units (OR logic)
  • Pair the check interval with your --since window to avoid counting the same entries twice (e.g. 1h check interval with --since 1h)

License

MIT