check_journald.sh
Nagios / Icinga plugin to count error-level log entries in systemd journald over a configurable time window.
Table of Contents
- Description
- Requirements
- Installation
- Usage
- Priority Levels
- Threshold Semantics
- Example Usage
- Example Output
- Return Codes
- NRPE Integration
- Notes
- License
Description
check_journald.sh queries journalctl for log entries at or above a configurable severity level within a fixed lookback window and alerts based on count thresholds.
It evaluates:
- Error count — number of log lines at or above the configured priority level
- Unit scope — optionally restricted to one or more specific systemd units
- Exclusion filter — optional regex to suppress known-noisy patterns
- Context lines — recent matching log lines are included in the plugin output for fast triage
Requirements
- Linux system with systemd / journald
journalctlavailable in PATH- The monitoring user must have read access to the journal:
usermod -aG systemd-journal nagios - bash, grep
- Nagios / Icinga (or compatible monitoring system)
Installation
-
Copy the script to your plugin directory:
/usr/lib/nagios/plugins/check_journald.sh -
Set executable permissions:
chmod 755 /usr/lib/nagios/plugins/check_journald.sh -
Grant journal read access to the monitoring user:
usermod -aG systemd-journal nagios
Usage
check_journald.sh [--since ] [--unit ] [--priority ]
[--warn ] [--crit ] [--show-lines ]
[--exclude ]
Parameters
| Parameter | Description |
|---|---|
--since |
Journald lookback window (default: 1h). Accepts any journalctl --since value, e.g. 30m, 2h, 1d |
--unit |
Filter to a specific systemd unit. Can be repeated for multiple units |
--priority |
Minimum log priority to count (default: err). All entries at this level and more severe are counted |
--warn |
Warning threshold for matching entry count (default: 1) |
--crit |
Critical threshold for matching entry count (default: 10) |
--show-lines |
Number of recent matching lines to append to output for triage (default: 3, set to 0 to disable) |
--exclude |
Extended regex (grep -E) to exclude matching lines from the count |
Default Values
| Parameter | Default |
|---|---|
| since | 1h |
| priority | err |
| warn | 1 |
| crit | 10 |
| show-lines | 3 |
Priority Levels
Journald uses syslog-compatible priority levels. The --priority flag sets the minimum severity to count — all entries at that level and above (more severe) are included.
| Level | Name | Typical use |
|---|---|---|
| 0 | emerg |
System is unusable |
| 1 | alert |
Immediate action required |
| 2 | crit |
Critical condition |
| 3 | err |
Error condition (default) |
| 4 | warning |
Warning condition |
| 5 | notice |
Normal but significant |
| 6 | info |
Informational |
| 7 | debug |
Debug messages |
Threshold Semantics
| Metric | Behavior |
|---|---|
| Log entry count | higher is worse |
Example with --warn 1 --crit 10:
0entries → OK1–9entries → WARNING≥ 10entries → CRITICAL
Example Usage
# System-wide, err+ in last 1h (defaults)
check_journald.sh
# Shorter window, lower threshold
check_journald.sh --since 30m --warn 1 --crit 5
# Monitor a specific unit
check_journald.sh --unit sshd --priority err --warn 1 --crit 5
# Monitor multiple units
check_journald.sh --unit nginx.service --unit php-fpm.service --since 2h
# Catch warnings too, with tighter thresholds
check_journald.sh --priority warning --warn 10 --crit 50
# Exclude known-noisy patterns
check_journald.sh --unit postfix.service --exclude "Connection reset by peer|Timeout"
# No context lines in output (cleaner for some dashboards)
check_journald.sh --show-lines 0 --warn 5 --crit 20
Example Output
[OK]: 0 err+ log entries in last 1h [system-wide] | log_errors=0;1;10;0;
[WARNING]: 3 err+ log entries in last 1h [system-wide] | log_errors=3;1;10;0;
2025-10-01T08:12:44+0200 myhost kernel: EXT4-fs error (device sdb1)
2025-10-01T08:13:01+0200 myhost sshd[2341]: error: Could not load host key
2025-10-01T08:14:22+0200 myhost postfix[9812]: error: connect to smtp.example.com
[CRITICAL]: 14 err+ log entries in last 1h [unit=nginx.service] | log_errors=14;1;10;0;
2025-10-01T08:19:11+0200 myhost nginx[441]: [error] connect() failed (111: Connection refused)
2025-10-01T08:19:14+0200 myhost nginx[441]: [error] connect() failed (111: Connection refused)
2025-10-01T08:19:17+0200 myhost nginx[441]: [error] no live upstreams while connecting to upstream
[OK]: 0 err+ log entries in last 2h [units=nginx.service,php-fpm.service] (exclude: Timeout) | log_errors=0;1;10;0;
Return Codes
| Code | State |
|---|---|
| 0 | OK |
| 1 | WARNING |
| 2 | CRITICAL |
| 3 | UNKNOWN |
NRPE Integration
Example /etc/nagios/nrpe.cfg:
# System-wide error check
command[check_journald]=/usr/lib/nagios/plugins/check_journald.sh --since 1h --warn 1 --crit 10
# Per-unit check
command[check_journald_nginx]=/usr/lib/nagios/plugins/check_journald.sh --unit nginx.service --warn 1 --crit 5
Or with dynamic arguments:
command[check_journald]=/usr/lib/nagios/plugins/check_journald.sh $ARG1$
(Restart NRPE after changes)
Notes
- The
--sincevalue is passed directly tojournalctl --since " ago". Supported formats follow journald's time syntax:30m,1h,2h,1d - The monitoring user needs to be a member of the
systemd-journalgroup to read the journal without root. Add withusermod -aG systemd-journal nagios --warn 1(default) means a single error in the window triggers a WARNING. Raise this threshold for noisy services where some errors are expected--excludeis applied after journalctl filtering — it does not affect what journald returns, only what gets counted- The
--show-linescontext is appended as extra lines after the main status line, compatible with Icinga2's$output$and$long_output$variables - When multiple
--unitflags are given, journalctl filters by any of the specified units (OR logic) - Pair the check interval with your
--sincewindow to avoid counting the same entries twice (e.g. 1h check interval with--since 1h)
License
MIT