Tools

Tools: What Is Crash Recovery? How Process Managers Keep Your App Online After Failures

2026-03-12 38 views admin

What Is Crash Recovery?

How Crash Recovery Works

What Determines Recovery Speed

1. The Manager's Own Runtime

2. Process Spawn Time

3. Health Check Configuration

What Happens If an App Keeps Crashing?

Crash Recovery vs. High Availability

Measuring Crash Recovery in Your Setup

Crash Recovery in Oxmgr Your production app crashes. A bug slips through, memory spikes, a network dependency times out and throws an unhandled exception — it doesn't matter why. What matters is what happens next. Crash recovery is the automatic process of detecting that an application has died and restarting it as fast as possible, before your users have time to notice. Without crash recovery, a process that crashes stays dead until a human intervenes. With it, the same crash can be invisible — the process restarts in milliseconds and keeps serving traffic. Crash recovery is one of the core reasons you need a process manager in production — without one, there's nothing watching your app to trigger a restart. Every operating system gives processes a way to signal their exit. When a process terminates — whether it crashes, runs out of memory, or is killed — it emits an exit event with a status code. A process manager listens for these events: The critical variable is how long this takes. The gap between the exit event and the new process serving traffic is your downtime window. Three factors control how fast a process manager can recover from a crash: A process manager written in a scripting language (JavaScript, Python, Ruby) has to do real work to respond to an exit event — the VM needs to be scheduled, the garbage collector might pause, the event loop might be busy. A compiled binary (Rust, Go, C) responds in microseconds. There's no VM, no GC, no interpreter. The exit handler fires and the spawn call happens immediately. This is the biggest factor. PM2 (Node.js daemon) recovers in ~400ms. Oxmgr (Rust binary) recovers in ~11ms. Spawning a new process takes time regardless of the manager. For a Node.js app: The process manager can't control how fast your app starts. But it can start the spawn immediately after detecting the crash, rather than waiting for polling intervals. After spawning, the manager needs to know when the process is ready. Two approaches: Port listening — wait until the process binds to its port. Simple, but doesn't guarantee the app is actually serving valid responses. HTTP health check — poll an endpoint until it returns 200. Slower to confirm readiness, but more accurate. For crash recovery, the key is not waiting longer than necessary. If your health check polls every 30 seconds but a crash recovers in 50ms, you're waiting 30 seconds to confirm what already happened. Automatic restart can create a "crash loop" — the app restarts, crashes immediately, restarts again, endlessly. This is worse than staying down in some ways: it makes logs unreadable and consumes CPU spinning up processes. Most process managers handle this with restart limits and backoff: Exponential backoff is more sophisticated — the delay doubles each time: This gives transient issues (network blips, temporary resource exhaustion) time to resolve while preventing runaway loops. These are related but different concepts: Crash recovery handles the period after a single process crashes — the goal is to minimize downtime for that process. High availability uses redundancy to eliminate downtime entirely — run 2+ instances so when one crashes, others continue serving traffic while the crashed one recovers. With 3 instances and 11ms crash recovery, a user hitting the crashed instance during that window is the only exposure. In practice, load balancers have already stopped routing to the crashed process within a similar timeframe. You can test your crash recovery speed manually: For PM2 users, the same test will show you real-world recovery times rather than theoretical numbers. Oxmgr is built around the assumption that crash recovery should be invisible to users. Key settings: With this config, a crash on one instance triggers an immediate restart. The other instance handles traffic during the ~50ms window (11ms manager + ~40ms Node.js startup for a simple app). See the docs for health check configuration and resource limit triggers. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

App process exits (status: 1 — error) ↓ Process manager receives exit event ↓ Check: is this process configured to restart? ↓ Yes → spawn new process ↓ Wait for process to be ready (health check or port listen) ↓ Resume serving traffic App process exits (status: 1 — error) ↓ Process manager receives exit event ↓ Check: is this process configured to restart? ↓ Yes → spawn new process ↓ Wait for process to be ready (health check or port listen) ↓ Resume serving traffic App process exits (status: 1 — error) ↓ Process manager receives exit event ↓ Check: is this process configured to restart? ↓ Yes → spawn new process ↓ Wait for process to be ready (health check or port listen) ↓ Resume serving traffic [processes.api.health_check] endpoint = "http://localhost:3000/health" interval_secs = 2 timeout_secs = 5 [processes.api.health_check] endpoint = "http://localhost:3000/health" interval_secs = 2 timeout_secs = 5 [processes.api.health_check] endpoint = "http://localhost:3000/health" interval_secs = 2 timeout_secs = 5 [processes.api] max_restarts = 10 # stop trying after 10 crashes restart_delay_ms = 500 # wait 500ms before each restart [processes.api] max_restarts = 10 # stop trying after 10 crashes restart_delay_ms = 500 # wait 500ms before each restart [processes.api] max_restarts = 10 # stop trying after 10 crashes restart_delay_ms = 500 # wait 500ms before each restart [processes.api] instances = 3 # crash recovery on one instance doesn't affect the other 2 [processes.api] instances = 3 # crash recovery on one instance doesn't affect the other 2 [processes.api] instances = 3 # crash recovery on one instance doesn't affect the other 2 # Find your process PID oxmgr status # Kill it hard (no graceful shutdown) kill -9 <pid> # Measure how long until it responds again time curl --retry 100 --retry-delay 0 --retry-connrefused http://localhost:3000/health # Find your process PID oxmgr status # Kill it hard (no graceful shutdown) kill -9 <pid> # Measure how long until it responds again time curl --retry 100 --retry-delay 0 --retry-connrefused http://localhost:3000/health # Find your process PID oxmgr status # Kill it hard (no graceful shutdown) kill -9 <pid> # Measure how long until it responds again time curl --retry 100 --retry-delay 0 --retry-connrefused http://localhost:3000/health [processes.api] command = "node dist/server.js" restart_on_exit = true restart_delay_ms = 0 # restart immediately max_restarts = 20 # allow 20 restarts before giving up instances = 2 # run 2 instances for redundancy [processes.api.health_check] endpoint = "http://localhost:3000/health" interval_secs = 10 timeout_secs = 3 [processes.api] command = "node dist/server.js" restart_on_exit = true restart_delay_ms = 0 # restart immediately max_restarts = 20 # allow 20 restarts before giving up instances = 2 # run 2 instances for redundancy [processes.api.health_check] endpoint = "http://localhost:3000/health" interval_secs = 10 timeout_secs = 3 [processes.api] command = "node dist/server.js" restart_on_exit = true restart_delay_ms = 0 # restart immediately max_restarts = 20 # allow 20 restarts before giving up instances = 2 # run 2 instances for redundancy [processes.api.health_check] endpoint = "http://localhost:3000/health" interval_secs = 10 timeout_secs = 3 - OS process creation: ~1–5ms - Node.js startup: ~50–200ms (depending on module load time) - Application initialization: varies - Crash 1: restart after 100ms - Crash 2: restart after 200ms - Crash 3: restart after 400ms

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolscrashrecoveryprocessmanagersonlineafterfailures

More from Tools

Tools: Gas-Aware Trading: Execute Only When Gas Is Cheap (2026)

2026-03-30 0

Tools: Grafana k6 Has a Free API That Load Tests Your APIs With JavaScript - Full Analysis

2026-03-30 0

Tools: Caddy Has a Free API That Gives You Automatic HTTPS With Zero Configuration (2026)

2026-03-30 0

Tools: Fly.io Has a Free API That Deploys Docker Apps Globally With Edge Hosting (2026)

2026-03-30 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: What Is Crash Recovery? How Process Managers Keep Your App Online After Failures

What Is Crash Recovery?

How Crash Recovery Works

What Determines Recovery Speed

1. The Manager's Own Runtime

2. Process Spawn Time

3. Health Check Configuration

What Happens If an App Keeps Crashing?

Crash Recovery vs. High Availability

Measuring Crash Recovery in Your Setup

🏷️ Tags

More from Tools

Tools: Gas-Aware Trading: Execute Only When Gas Is Cheap (2026)

Tools: Grafana k6 Has a Free API That Load Tests Your APIs With JavaScript - Full Analysis

Tools: Caddy Has a Free API That Gives You Automatic HTTPS With Zero Configuration (2026)

Tools: Fly.io Has a Free API That Deploys Docker Apps Globally With Edge Hosting (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting