Tools

Tools: My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script. (2026)

2026-05-21 0 views admin

The Script

Why systemctl is-active --quiet and Not Something Else

The Two-Level Failure Check

Setting It Up with Cron

What Else I Watch With This

Pairing This With Other Scripts Nginx crashed on a Saturday night. An OOM kill, probably — I was running a Node app that leaked memory like a broken faucet. The service went down at 2:14 AM. I found out at 8:30 AM when I opened my laptop and saw Slack messages from six hours earlier asking why the site was down. The fix took 10 seconds: sudo systemctl start nginx. The downtime cost me a weekend of credibility. The thing is, systemctl already knows when a service dies. I just wasn't asking it to check. So I wrote a script that asks every 60 seconds and restarts the service if it's down. Took less time to write than it did to explain the outage to my team. I've seen people use ps aux | grep nginx for this. Don't. Here's why: ps aux | grep nginx has a classic gotcha — the grep command itself shows up in the results because the word "nginx" is in the grep command line. People "fix" this with grep -v grep which works but is fragile and ugly. You're parsing process tables to answer a question that systemd already tracks natively. systemctl is-active --quiet "$SERVICE" asks systemd directly: "is this unit in the active state?" The --quiet flag suppresses output and just returns an exit code. 0 means active. Anything else means it's not running. Clean, reliable, no string parsing. This isn't just "is it down → restart it." There are two separate failure modes: Level 1: Is the service running? If yes, print the check mark and exit. No log noise, no wasted disk. Level 2: If the service is down and we try to restart it — did the restart actually work? systemctl start can fail for a dozen reasons: masked unit, broken config file, dependency that's also down, port already in use by something else. The script checks the exit code of the start command and sends a different email depending on whether recovery succeeded or failed. The [RECOVERED] email means the script fixed it and you can keep sleeping. The [CRITICAL] email means something is actually broken and you need to look at it. That distinction matters at 3 AM. That runs every single minute. Is that overkill? Maybe. But the script finishes in under 100ms when the service is healthy, and the alternative is 6 hours of downtime on a Saturday night. I'll take the overkill. One gotcha with cron and sudo: cron runs with a minimal environment and no terminal. If sudo systemctl start prompts for a password, it hangs silently forever. You need a sudoers rule: Or just run the watchdog cron as root. The SERVICE variable takes any systemd unit name. I run separate copies for: If you want to watch multiple services in one script, loop through them: But I prefer separate scripts per service because the logs stay clean and each one can have a different notification strategy. This watchdog handles the restart. But if you also want to know why the service died, pair it with: Between these three scripts, you've got a basic monitoring stack that runs entirely on cron and costs nothing. Full script, the line-by-line breakdown, cron setup walkthrough, and three more variations: bashsnippets.xyz/snippets/restart-service-if-stopped.html If you're managing any Linux server with services that need to stay up, this takes 5 minutes to deploy and runs quietly forever. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

#!/bin/bash CHECK="✓" CROSS="✗" # --- Configuration --- SERVICE="nginx" # Change to your -weight: 500;">service name LOG_FILE="/var/log/-weight: 500;">service-watchdog.log" DATE=$(date '+%Y-%m-%d %H:%M:%S') NOTIFY_EMAIL="" # Optional: [email protected] # --- Check if -weight: 500;">service is running --- if -weight: 500;">systemctl is-active --quiet "$SERVICE"; then echo "$CHECK [$DATE] $SERVICE is running" else echo "$CROSS [$DATE] $SERVICE is NOT running — attempting -weight: 500;">restart..." echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE" # --- Attempt -weight: 500;">restart --- if -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start "$SERVICE"; then echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE" # --- Optional: send email notification --- if [ -n "$NOTIFY_EMAIL" ]; then echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \ | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL" fi else echo "$CROSS [$DATE] $SERVICE FAILED to -weight: 500;">restart — manual intervention needed" \ | tee -a "$LOG_FILE" if [ -n "$NOTIFY_EMAIL" ]; then echo "$SERVICE failed to -weight: 500;">restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \ | mail -s "[CRITICAL] $SERVICE -weight: 500;">restart failed" "$NOTIFY_EMAIL" fi fi fi #!/bin/bash CHECK="✓" CROSS="✗" # --- Configuration --- SERVICE="nginx" # Change to your -weight: 500;">service name LOG_FILE="/var/log/-weight: 500;">service-watchdog.log" DATE=$(date '+%Y-%m-%d %H:%M:%S') NOTIFY_EMAIL="" # Optional: [email protected] # --- Check if -weight: 500;">service is running --- if -weight: 500;">systemctl is-active --quiet "$SERVICE"; then echo "$CHECK [$DATE] $SERVICE is running" else echo "$CROSS [$DATE] $SERVICE is NOT running — attempting -weight: 500;">restart..." echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE" # --- Attempt -weight: 500;">restart --- if -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start "$SERVICE"; then echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE" # --- Optional: send email notification --- if [ -n "$NOTIFY_EMAIL" ]; then echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \ | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL" fi else echo "$CROSS [$DATE] $SERVICE FAILED to -weight: 500;">restart — manual intervention needed" \ | tee -a "$LOG_FILE" if [ -n "$NOTIFY_EMAIL" ]; then echo "$SERVICE failed to -weight: 500;">restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \ | mail -s "[CRITICAL] $SERVICE -weight: 500;">restart failed" "$NOTIFY_EMAIL" fi fi fi #!/bin/bash CHECK="✓" CROSS="✗" # --- Configuration --- SERVICE="nginx" # Change to your -weight: 500;">service name LOG_FILE="/var/log/-weight: 500;">service-watchdog.log" DATE=$(date '+%Y-%m-%d %H:%M:%S') NOTIFY_EMAIL="" # Optional: [email protected] # --- Check if -weight: 500;">service is running --- if -weight: 500;">systemctl is-active --quiet "$SERVICE"; then echo "$CHECK [$DATE] $SERVICE is running" else echo "$CROSS [$DATE] $SERVICE is NOT running — attempting -weight: 500;">restart..." echo "$CROSS [$DATE] $SERVICE DOWN — restarting" >> "$LOG_FILE" # --- Attempt -weight: 500;">restart --- if -weight: 600;">sudo -weight: 500;">systemctl -weight: 500;">start "$SERVICE"; then echo "$CHECK [$DATE] $SERVICE restarted successfully" | tee -a "$LOG_FILE" # --- Optional: send email notification --- if [ -n "$NOTIFY_EMAIL" ]; then echo "$SERVICE was down and has been restarted on $(hostname) at $DATE" \ | mail -s "[RECOVERED] $SERVICE restarted" "$NOTIFY_EMAIL" fi else echo "$CROSS [$DATE] $SERVICE FAILED to -weight: 500;">restart — manual intervention needed" \ | tee -a "$LOG_FILE" if [ -n "$NOTIFY_EMAIL" ]; then echo "$SERVICE failed to -weight: 500;">restart on $(hostname) at $DATE. Check: journalctl -u $SERVICE" \ | mail -s "[CRITICAL] $SERVICE -weight: 500;">restart failed" "$NOTIFY_EMAIL" fi fi fi * * * * * /home/user/-weight: 500;">service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1 * * * * * /home/user/-weight: 500;">service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1 * * * * * /home/user/-weight: 500;">service-watchdog.sh >> /var/log/watchdog-cron.log 2>&1 -weight: 600;">sudo visudo # Add this line: youruser ALL=(ALL) NOPASSWD: /bin/-weight: 500;">systemctl -weight: 500;">start nginx -weight: 600;">sudo visudo # Add this line: youruser ALL=(ALL) NOPASSWD: /bin/-weight: 500;">systemctl -weight: 500;">start nginx -weight: 600;">sudo visudo # Add this line: youruser ALL=(ALL) NOPASSWD: /bin/-weight: 500;">systemctl -weight: 500;">start nginx SERVICES=("nginx" "mysql" "redis-server") for SERVICE in "${SERVICES[@]}"; do # ... same check logic ... done SERVICES=("nginx" "mysql" "redis-server") for SERVICE in "${SERVICES[@]}"; do # ... same check logic ... done SERVICES=("nginx" "mysql" "redis-server") for SERVICE in "${SERVICES[@]}"; do # ... same check logic ... done - nginx — the web server - mysql or mariadb — the database - -weight: 500;">docker — the container daemon - Custom services: my-node-app.-weight: 500;">service, redis-server, postgresql - Monitor CPU & RAM Usage — catches the OOM conditions that kill services in the first place - Send Email Alert from Bash — the email sending setup if you've never configured mail on Linux

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsnginxnobodynoticedhourswatchdogscriptapt

More from Tools

Tools: Linux Socket Network Programming (Theoretical Analysis + Comprehensive Examples) (2026)

2026-05-21 0

Tools: How to Safely Run Claude Code on Ubuntu 24.04: The SRE Playbook (2026)

2026-05-21 0

Tools: Linux File System Explained Simply - Full Analysis

2026-05-21 0

Tools: Complete Guide to WordPress site down: the 15-minute emergency response checklist

2026-05-21 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: My Nginx Died at 2 AM and Nobody Noticed for 6 Hours. Now I Have a Watchdog Script. (2026)

The Script

Why systemctl is-active --quiet and Not Something Else

The Two-Level Failure Check

Setting It Up with Cron

What Else I Watch With This

🏷️ Tags

More from Tools

Tools: Linux Socket Network Programming (Theoretical Analysis + Comprehensive Examples) (2026)

Tools: How to Safely Run Claude Code on Ubuntu 24.04: The SRE Playbook (2026)

Tools: Linux File System Explained Simply - Full Analysis

Tools: Complete Guide to WordPress site down: the 15-minute emergency response checklist

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting