Tools: Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

Tools: Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

Why Linux Kills Your App Without Warning (The OOM Killer, Explained)

The Setup

The First Wrong Turn

Finding the OOM Killer in the Act

Why Does the Kernel Do This

The Short-Term Fix

The Long-Term Fix

Prevention Checklist

What Interviewers Look For

Practice It Interactively There's a specific flavor of production incident that everyone encounters eventually: your app just... disappears. No stack trace. No crash log. No application errors. The process is running, and then it isn't. If you've ever been on-call at 2am staring at systemd telling you Main process exited, code=killed, signal=KILL with zero explanation, this post is for you. I turned this exact scenario into one of 18 debugging exercises at scenar.site - practice it interactively with an AI interviewer. Details at the end. Java service on an 8GB server. Runs for hours, sometimes days, then dies. Developers swear there's no bug. You check the logs - nothing. Just normal operation right up until the end, then silence. signal=KILL. SIGKILL. The nuclear option. The signal you can't catch, can't ignore, can't clean up from. Something sent SIGKILL to your Java process. Most people (past me included) start grepping application logs. You won't find anything. SIGKILL doesn't let the process finish its current syscall, let alone flush a log buffer. The app didn't crash - it was executed. The right first move is to ask: who could have killed this? Your options are: If it's #3, there's exactly one culprit: the OOM killer. The OOM killer writes to the kernel ring buffer. You read it with dmesg: There it is. The kernel killed PID 8921 (your Java process) to free memory. anon-rss:4194304kB = your process was using 4GB of anonymous RSS when it got killed. Linux overcommits memory. When you malloc(1GB), the kernel says "sure" without actually having 1GB free. It's a bet that most processes won't use all the memory they request. Usually that's fine. When it's not fine, the kernel has to pick a process to kill to free memory - because the alternative is the whole system locking up. The selection is based on oom_score (higher = more likely to be killed). Check it live: The Java app had the highest score, so it got picked. Scores weight recent memory usage, process age, and oom_score_adj. You have three options for right now: 1. Reduce the Java heap. If your systemd unit has -Xmx4g, the JVM will absolutely use 4GB. Drop it: 2. Protect critical processes with oom_score_adj. Range is -1000 (never kill) to 1000 (kill first): Don't set -1000 unless you're sure you want that process to be the last thing running before the kernel panics. I've seen "protected" apps keep a dying system from recovering. 3. Add swap. Buys you time but swapping kills performance. Emergency only. The real problem is usually that the server is overcommitted. MySQL wants 4GB, Prometheus wants 2GB, Elasticsearch wants 1.5GB, Java wants 4GB - on an 8GB box. That's 11.5GB of ask on 8GB of capacity. Something has to give eventually. Move workloads off shared hosts. Put the DB on its own box, the app servers on theirs. Stop co-locating memory-hungry services. Monitor memory pressure, not just memory usage. /proc/pressure/memory (PSI) tells you when processes are stalled waiting for memory, which is a much earlier signal than "out of memory" alerts. Before the next OOM kill: If this comes up in an SRE interview, they're testing for: The last one catches a surprising number of people - they conflate the two and assume bumping -Xmx fixes everything, when actually a bigger heap makes OS-level OOM more likely. This scenario is one of 18 on scenar.site. You describe your debugging approach in plain English, an AI simulates the broken server with realistic command output, and plays interviewer by pushing back on your reasoning. Free tier gets you started, no credit card. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">systemctl -weight: 500;">status myapp ● myapp.-weight: 500;">service - My Java Application Loaded: loaded (/etc/systemd/system/myapp.-weight: 500;">service; enabled) Active: failed (Result: signal) since Tue 2026-01-20 14:23:15 UTC Jan 20 14:23:15 app-server-01 systemd[1]: myapp.-weight: 500;">service: Main process exited, code=killed, signal=KILL Jan 20 14:23:15 app-server-01 systemd[1]: myapp.-weight: 500;">service: Failed with result 'signal'. $ -weight: 500;">systemctl -weight: 500;">status myapp ● myapp.-weight: 500;">service - My Java Application Loaded: loaded (/etc/systemd/system/myapp.-weight: 500;">service; enabled) Active: failed (Result: signal) since Tue 2026-01-20 14:23:15 UTC Jan 20 14:23:15 app-server-01 systemd[1]: myapp.-weight: 500;">service: Main process exited, code=killed, signal=KILL Jan 20 14:23:15 app-server-01 systemd[1]: myapp.-weight: 500;">service: Failed with result 'signal'. $ -weight: 500;">systemctl -weight: 500;">status myapp ● myapp.-weight: 500;">service - My Java Application Loaded: loaded (/etc/systemd/system/myapp.-weight: 500;">service; enabled) Active: failed (Result: signal) since Tue 2026-01-20 14:23:15 UTC Jan 20 14:23:15 app-server-01 systemd[1]: myapp.-weight: 500;">service: Main process exited, code=killed, signal=KILL Jan 20 14:23:15 app-server-01 systemd[1]: myapp.-weight: 500;">service: Failed with result 'signal'. $ dmesg | grep -i oom [482341.234] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [482341.234] Out of memory: Killed process 8921 (java) total-vm:6291456kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0 [482341.235] oom_reaper: reaped process 8921 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB $ dmesg | grep -i oom [482341.234] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [482341.234] Out of memory: Killed process 8921 (java) total-vm:6291456kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0 [482341.235] oom_reaper: reaped process 8921 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB $ dmesg | grep -i oom [482341.234] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [482341.234] Out of memory: Killed process 8921 (java) total-vm:6291456kB, anon-rss:4194304kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0 [482341.235] oom_reaper: reaped process 8921 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB $ ps aux --sort=-%mem | head USER PID %CPU %MEM RSS COMMAND mysql 2341 15.2 45.5 3728000 /usr/sbin/mysqld root 3456 5.1 28.3 2320000 /usr/bin/prometheus elastic 4567 8.4 18.2 1490000 /usr/share/elasticsearch/jdk/bin/java $ ps aux --sort=-%mem | head USER PID %CPU %MEM RSS COMMAND mysql 2341 15.2 45.5 3728000 /usr/sbin/mysqld root 3456 5.1 28.3 2320000 /usr/bin/prometheus elastic 4567 8.4 18.2 1490000 /usr/share/elasticsearch/jdk/bin/java $ ps aux --sort=-%mem | head USER PID %CPU %MEM RSS COMMAND mysql 2341 15.2 45.5 3728000 /usr/sbin/mysqld root 3456 5.1 28.3 2320000 /usr/bin/prometheus elastic 4567 8.4 18.2 1490000 /usr/share/elasticsearch/jdk/bin/java $ cat /proc/2341/oom_score 456 $ cat /proc/8921/oom_score 512 $ cat /proc/2341/oom_score 456 $ cat /proc/8921/oom_score 512 $ cat /proc/2341/oom_score 456 $ cat /proc/8921/oom_score 512 ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar $ echo -500 > /proc/$(pidof java)/oom_score_adj $ echo -500 > /proc/$(pidof java)/oom_score_adj $ echo -500 > /proc/$(pidof java)/oom_score_adj [Service] ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar MemoryMax=2.5G MemoryHigh=2G [Service] ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar MemoryMax=2.5G MemoryHigh=2G [Service] ExecStart=/usr/bin/java -Xmx2g -jar /opt/myapp/app.jar MemoryMax=2.5G MemoryHigh=2G - A human with -weight: 600;">sudo (ask around) - A monitoring tool with an auto-remediation rule (check) - The kernel itself - Use cgroups / systemd MemoryMax= to enforce limits per -weight: 500;">service. This is the proper fix. Each -weight: 500;">service gets a guaranteed ceiling. If it exceeds its own cgroup limit, it gets killed inside its cgroup (via oom_kill_disable=0) without taking the whole box down. - Move workloads off shared hosts. Put the DB on its own box, the app servers on theirs. Stop co-locating memory-hungry services. - Monitor memory pressure, not just memory usage. /proc/pressure/memory (PSI) tells you when processes are stalled waiting for memory, which is a much earlier signal than "out of memory" alerts. - Every -weight: 500;">service has a MemoryMax= in its systemd unit - Alert on memory available < 10% for 5 minutes, not just on events after death - Alert on memory PSI (avg10 > 10) - catches swapping and thrashing before OOM - Java apps have -XX:+HeapDumpOnOutOfMemoryError so you get something when the JVM itself runs out of heap (different from OS OOM) - Document which processes are "protected" (oom_score_adj < 0) and why - Do you know SIGKILL can't be caught, so absence of logs is a clue, not a failure? - Do you go to dmesg without being told to? - Can you explain why the kernel kills processes (overcommit, not "a bug")? - Do you talk about cgroups as the structural fix, not just tuning oom_score_adj? - Can you distinguish OS-level OOM from JVM-level OutOfMemoryError?