Tools: The Hidden Life of a Container: A Complete Lifecycle - Full Analysis

Tools: The Hidden Life of a Container: A Complete Lifecycle - Full Analysis

Before the Container: The Image as a Promise

Stage 1: docker create — Assembling the Environment

Stage 2: docker start — Crossing the Threshold

Stage 3: Running — What the Commands Actually Do

Stage 4: docker stop — The Two-Phase Shutdown

The First Danger Zone: Zombie Processes

The Second Danger Zone: The OOM Killer

Stage 5: docker rm — Dismantling the Environment

The Full Picture The anatomy of a container tells you what the walls are made of. This article tells you when they go up, what happens inside them, and what the kernel does the moment they come down. In the previous article, we dissected a running container at a single moment in time: the namespaces building the illusion of isolation, the cgroups enforcing the resource constitution, OverlayFS merging layers into a coherent filesystem, the veth pair wiring the container into the network. A cross-section. A photograph. But a photograph doesn't tell you how the subject got there, or what happens after the shutter closes. That's what this piece is for. We'll follow a single container from before it exists to after it's gone — and at each transition, we'll look at what actually changes at the kernel level. The machinery is the same as before. Now we watch it move. We'll also cover the three places where production systems quietly break in ways that monitoring won't catch until it's too late: signal handling, zombie processes, and the OOM killer. Before anything runs, the runtime needs something to run from. An image isn't a virtual machine disk or a binary blob — it's a stack of read-only filesystem layers, each one a delta on the last, stored under /var/lib/docker/overlay2/ and identified by SHA256 digest rather than name. When you run docker pull, the daemon fetches a manifest first: a JSON document listing every layer by digest. It then checks which of those digests already exist locally and downloads only the missing ones. This is why pulling a new version of an image that shares a base layer with one you already have takes seconds — the common layers are already there. The registry protocol is content-addressed, which means layers are inherently deduplicated across every image on the host that uses them. Each downloaded layer is verified against its digest before being committed to the store. A tampered or corrupted layer fails this check before it touches anything. The image at rest is inert. It's a promise of what the container's filesystem will look like when it starts — nothing more. The OverlayFS that merges those layers into a live, writable filesystem doesn't exist yet. That comes with create. Here is something most engineers don't know: docker run is not a single operation. It is docker create followed immediately by docker start. Knowing this distinction is practically useful — but more importantly, it clarifies when each piece of kernel infrastructure comes into existence. docker create builds the complete environment the container will inhabit. Nothing executes. No process runs. It is pure allocation. The OverlayFS mount. The runtime creates a new upperdir — an empty, writable directory unique to this container instance. The image's read-only layers become the lowerdir. The OverlayFS is configured to present both as a single unified filesystem: reads fall through to the lower layers, writes land in the upper layer, and the container will see a seamless root filesystem when it starts. The layers you read about in the anatomy piece are present here. The writable layer that captures everything your container does is created here. The namespace allocation. The six namespaces — PID, NET, MNT, IPC, UTS, USER — are prepared. At this point they're reserved but not yet active. There's nothing to isolate, so isolation hasn't started. The cgroup hierarchy. A cgroup is created for this container under /sys/fs/cgroup/. The memory limit from --memory 512m and the CPU quota from --cpus 0.5 are written into the cgroup's control files right now. The cgroup exists. The limits are set. But with no processes inside it, it enforces nothing — a constitution with no citizens. The network configuration. The veth pair is created: one end destined for the container's network namespace, one end connecting to the docker0 bridge on the host. An IP is assigned. NAT rules are written to iptables for port mappings. The virtual cable exists, but nothing is plugged in on the container end yet. After docker create, the container is a complete kernel data structure — every wall built, every rule written, every wire run — with no tenant. Its state is "created". Nothing has been charged to the CPU. Nothing has touched the writable layer. The practical implication: you can pre-warm containers during off-peak hours and start them in milliseconds when demand arrives. The expensive work of building the environment is already done. This is the moment the environment becomes inhabited. Three components work in sequence: dockerd, containerd, and containerd-shim. dockerd instructs containerd to start the container. containerd forks a containerd-shim — the small, dedicated supervisor we covered in the daemon article — which then calls runc. runc is where the kernel call happens. It takes the container's prepared configuration and calls clone() with the namespace flags: CLONE_NEWPID, CLONE_NEWNET, CLONE_NEWNS, and the rest. The namespaces activate. The OverlayFS is mounted. The cgroup begins accounting. The veth pair connects. Then runc calls execve() with whatever you specified as ENTRYPOINT or CMD. That process — your application — becomes PID 1 inside the container. Once execve() returns, runc exits. It has done its job. The containerd-shim stays behind, holding the container's stdin/stdout file descriptors and watching PID 1. If dockerd crashes, the shim keeps the container alive — which is the architecture we walked through in detail previously. The moment docker start completes, everything the anatomy article described is live and active: the isolated process tree, the private network stack, the merged filesystem, the cgroup enforcing limits. The photograph is now a film. And this is the first place things go quietly wrong. PID 1 carries obligations that your application almost certainly doesn't know about. We'll come back to that shortly. A running container looks opaque from the outside. Under the hood, every docker command in this state is a thin wrapper over kernel operations you already understand. docker exec is the most commonly misunderstood. When you run docker exec -it myapp-prod /bin/sh, Docker doesn't create a new container. It calls setns() — a system call that lets a process join existing namespaces. Your shell enters the container's PID and network namespaces and sees the same environment the application sees. Nothing new is created. You're walking into an existing apartment, not building a new one. docker logs reads from the file descriptors the containerd-shim has been holding since start. The shim inherited the container's stdout and stderr pipes and has been buffering them since the first byte was written. No daemon magic — the shim is just a persistent pipe holder. docker stats reads cgroup control files directly: /sys/fs/cgroup/memory.current for RSS, CPU accounting files for usage. The numbers you see in the terminal are the same numbers the kernel is maintaining to enforce your limits. docker pause is the most elegant operation in this set. It uses the cgroup freezer subsystem to send SIGSTOP to every process in the container simultaneously — not one at a time, but atomically, via the cgroup. They freeze mid-instruction with no opportunity to catch the signal or react. docker unpause sends SIGCONT through the same path and they resume exactly where they stopped. In-flight operations, open sockets, memory contents — all preserved. This is where most production problems originate, and where the PID 1 decision made at start time comes due. docker stop is a two-phase protocol. Phase one: SIGTERM is sent to PID 1. The application has a grace period — ten seconds by default, adjustable with -t — to finish what it's doing and exit cleanly. Close database connections. Drain in-flight requests. Flush logs. Phase two: if PID 1 is still alive when the timer expires, SIGKILL is sent. There is no negotiation with SIGKILL. The kernel terminates the process immediately and unconditionally. The failure mode that bites teams most often: the shell wrapper. Many containers are launched like this: CMD ["sh", "-c", "node server.js"]. The shell becomes PID 1. When Docker sends SIGTERM to PID 1, the shell receives it — and by default, shells do not forward signals to their children. Your Node.js process never sees SIGTERM. Ten seconds pass. SIGKILL arrives. The server dies mid-request, connections drop, and the exit code suggests a crash rather than a clean shutdown. The fix is to make your application PID 1 directly. Use the exec form: CMD ["node", "server.js"]. Or, if you need a shell script for setup, end it with exec node server.js — the exec replaces the shell process rather than forking a child, so your application inherits PID 1. Ten seconds is also almost always too short. An application with database connection pools to drain and requests to finish needs more. Set -t 30 as a floor. For services handling long-lived connections, -t 60 or higher is appropriate. The grace period is your only window for a clean shutdown — size it honestly. After a successful stop, the container's state becomes "exited". Crucially, the writable OverlayFS layer is preserved — every file the application created or modified is still there. The container can be restarted with docker start and will resume from exactly the filesystem state it stopped in. On a normal Linux host, PID 1 is the init system. One of init's core responsibilities — one so fundamental it's baked into the kernel's design — is reaping dead child processes. When a process exits, it doesn't fully disappear. Its kernel entry stays open, holding its exit code, until its parent calls wait() to collect that code. Until then, it sits in state Z: a zombie. Dead, but not yet buried. Init's job is to call wait() on every process that loses its parent, so the kernel slot gets freed. Without init doing this, zombie entries accumulate indefinitely. In a container, your application is PID 1. And your application — a web server, an API, a worker — was written to serve requests, not to supervise processes. It almost certainly never calls wait(). The scenario: your web server forks a worker to handle a long request. The worker finishes and calls exit(). It waits for its parent to acknowledge the exit. The parent never does. The worker's kernel entry sits in Z state. Do this at scale, over days of production traffic, and the PID table fills. Eventually fork() fails — the container cannot spawn new processes — and the application breaks in a way that looks nothing like its actual cause. docker stats will show the container as healthy throughout. CPU and memory are fine. The zombie count isn't surfaced by any standard metric. The fix is a proper init process sitting in front of your application. tini is the canonical choice — purpose-built for containers, less than a thousand lines of C, does one thing: reap zombies and forward signals correctly. If you don't want to touch the image, docker run --init injects a bundled init binary automatically. Use one or the other on every production container, as a default, without exception. To check for zombies in a running container: Any output means you have a problem. The fix is tini. The time to add it is before the deployment, not during the incident. The cgroup memory limit — set at create time, written to memory.max in cgroup v2 — is not a suggestion. When a container's resident memory usage reaches that ceiling, the kernel's Out-Of-Memory killer activates. The OOM killer scores every process in the cgroup by a "badness" calculation: how much memory the process uses relative to total system memory, adjusted by that process's oom_score_adj value. The highest scorer is sent SIGKILL. In a single-process container, there is exactly one candidate. PID 1 always loses. The critical distinction from docker stop: the OOM killer sends no SIGTERM. There is no grace period. No warning, no chance to flush state, close connections, or write a final log line. The process simply stops existing. Docker records this as OOMKilled: true in the container's inspect output: The kernel also logs it to dmesg: Mitigation works at two levels. The first is setting a soft limit below the hard limit: --memory-reservation tells the kernel to apply gentle memory pressure — nudging the container's garbage collector, encouraging page reclamation — before the hard ceiling is reached. It's an early warning system rather than a wall. The container can still exceed the reservation temporarily; only --memory is the hard stop. The second level is treating OOM kills as bugs, not tuning parameters. A container that is repeatedly OOM-killed has either a memory leak or limits set below its actual working set. Raising the limit is a short-term fix. Finding the leak is the real work. docker rm is the exact inverse of docker create. Every kernel structure that create allocated is released in reverse. The OverlayFS upper layer is deleted. Every file your application created, modified, or deleted since the container started — the entire writable history — is gone. If your application wrote logs or state directly to the container filesystem, it's unrecoverable. This is the reason containerised applications should write persistent data to mounted volumes, not the container filesystem: docker rm treats the writable layer as ephemeral by design. The veth pair is removed. The IP address returns to Docker's pool. The iptables NAT rules for port mappings are deleted. The network configuration from create is fully unwound. The cgroup hierarchy is destroyed. The entries under /sys/fs/cgroup/ that tracked this container's memory and CPU are removed. The kernel stops accounting for this container entirely. What survives: the image layers. The read-only lowerdir that formed the container's base filesystem is untouched — available immediately for the next container. And named volumes survive unless you pass --volumes to docker rm. Persistent data stored in volumes outlives the container that created it, which is the intended design. The anatomy article showed you the ingredients. This one showed you the recipe — the same namespaces, cgroups, OverlayFS, and veth pairs assembled in sequence, operated during a container's life, and dismantled at its end. The mental model to carry forward: Create is allocation, start is execution. Every kernel structure — namespaces, cgroups, OverlayFS, veth pairs — exists before your application runs a single instruction. This is why pre-warming works and why restart is fast. PID 1 is a contract, not just a position. The kernel gives PID 1 responsibilities that init normally handles: signal forwarding and zombie reaping. If your application holds that position, it inherits those responsibilities. Use tini or --init by default. SIGTERM is your only warning. The grace period between SIGTERM and SIGKILL is the only window you get for a clean shutdown. Your application must handle the signal, and the window must be sized for your actual workload. Ten seconds is almost always too short. The OOM killer gives no warning at all. There is no SIGTERM, no grace period, no log line from your application. If OOMKilled: true appears in production, treat it as a bug — because it is one. The writable layer is temporary. docker rm destroys it completely. Write persistent state to volumes. Treat the container filesystem as scratch space that disappears when the container does. docker run looks like a single command. It is actually the sequential execution of eight distinct kernel operations, each with observable state and specific failure modes. The engineers who debug containers fastest aren't the ones who know the most flags — they're the ones who can trace a problem back to the operation where the kernel did something unexpected. Part of an ongoing series on container internals. Previous: A Deep Dive into Docker: What Ticks Under the Hood? Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

RUN apt-get install -y tini ENTRYPOINT ["/tini", "--", "/usr/local/bin/myapp"] RUN apt-get install -y tini ENTRYPOINT ["/tini", "--", "/usr/local/bin/myapp"] RUN apt-get install -y tini ENTRYPOINT ["/tini", "--", "/usr/local/bin/myapp"] docker exec myapp-prod ps aux | grep ' Z ' docker exec myapp-prod ps aux | grep ' Z ' docker exec myapp-prod ps aux | grep ' Z ' docker inspect myapp-prod --format '{{.State.OOMKilled}}' true docker inspect myapp-prod --format '{{.State.OOMKilled}}' true docker inspect myapp-prod --format '{{.State.OOMKilled}}' true dmesg | grep -i "oom\|killed process" dmesg | grep -i "oom\|killed process" dmesg | grep -i "oom\|killed process" docker run --memory 512m --memory-reservation 400m myapp docker run --memory 512m --memory-reservation 400m myapp docker run --memory 512m --memory-reservation 400m myapp