Tools: microVM networking from the ground up: virtio, TAP devices, guest kernels, and why your containers can't reach the internet

Tools: microVM networking from the ground up: virtio, TAP devices, guest kernels, and why your containers can't reach the internet

The four layers you're actually dealing with

How the host connects to the guest: TAP devices and virtio-net

The host side: TAP devices

The guest side: virtio-net

What breaks at this layer

Inside the guest kernel: what CONFIG options actually control

The kernel as the real configuration surface

How to check what your kernel has

The container networking config map

Rebuilding the Firecracker guest kernel

The guest boot contract: what PID 1 must do before anything else

What each mount provides

A minimal correct init for Docker workloads

Container networking inside the guest: bridges, veth, NAT

Network namespaces

veth pairs

The docker0 bridge

NAT and port publishing

Netfilter internals: the framework behind all packet policy

Framework vs tools

The 5 hooks

Tables and chains

conntrack

The x_tables / nf_tables split: where the Firecracker failure actually lives

Alternatives at each layer

Host-to-guest connectivity

Container networking inside the guest

Packet filtering

Debugging methodology: testing each layer in isolation

The layer-by-layer audit

The kernel config audit

Closing: the tradeoff made explicit You spent three days on a Firecracker CI runner. Containers are starting, Docker is running, but nothing can reach the network. You've restarted dockerd four times. You've googled the error messages. You've found Stack Overflow posts that describe your exact situation and then go silent. The problem is that every networking tutorial treats the stack as a single flat thing. In a microVM running containers, there are actually four discrete networking layers stacked on top of each other, each with its own configuration surface and its own failure modes. The moment you confuse one for another, debugging turns into guessing. This post maps those layers. It uses a real failure sequence (Firecracker CI runner with Docker service containers) as an anchor throughout. By the end, you should be able to audit your own setup layer by layer and know exactly where something went wrong and why. Here they are, briefly, because each gets a full section: Host networking is how the hypervisor exposes a virtual NIC to the guest. TAP devices, virtio-net, macvlan. This is the physical link of the virtual world. Guest kernel networking is the guest's own network stack, routing table, and interface configuration. Critically, it also means the kernel must have been compiled with the right feature flags. More on this shortly. Container overlay networking is the bridges, veth pairs, and NAT rules that Docker builds inside the guest. This is the layer most developers think of as "Docker networking." Packet filtering is netfilter, iptables, nftables, conntrack. The policy layer that sits over everything. It is also, in practice, the layer that blows up in the most confusing ways when the guest kernel is missing pieces. Each layer has its own commands, its own config files, and its own way of failing silently. The goal of this post is to give you a clear enough mental model that a failure at layer 2 doesn't look like a layer 4 mystery. A TAP device is a virtual network interface that looks like a real NIC to the kernel but is backed by a file descriptor that a userspace process reads and writes. Firecracker holds that file descriptor. Every packet the guest sends exits through that fd into Firecracker; every packet Firecracker writes into that fd arrives at the guest's NIC. ip tuntap add creates this interface in the kernel. TAP (as opposed to TUN) operates at Layer 2, meaning it passes full ethernet frames, not just IP packets. VMs need TAP because the guest needs to see a full NIC, not just an IP tunnel. Once the TAP device exists, you need to tell the host how to route traffic to and from it. There are three approaches: a static route pointing at the TAP IP, bridging the TAP to the host LAN, or masquerade NAT on the host uplink. For the typical case where you just want the guest to reach the internet, masquerade is the right choice. That means two things on the host: The first allows the host kernel to forward packets between interfaces. The second rewrites the source IP on packets leaving the host to the host's own IP, so replies know how to get back. Without both of these, the guest has no path out. Firecracker's network-interfaces API call creates the attachment between the TAP device and the VM. You pass the host_dev_name (the TAP interface name), and optionally a static MAC address. That's the only network configuration Firecracker itself handles. Everything else is your problem. Inside the guest, the TAP device appears as a virtio-net NIC. Unlike emulated hardware NICs such as e1000, virtio-net has no pretense of being real hardware. It uses a shared-memory ring buffer between the guest and the VMM for packet transfer. The ring buffer has three parts: a descriptor table that describes memory regions, an available ring that the guest uses to hand buffers to the device, and a used ring that the device fills when it's done with them. The guest writes a packet, places a descriptor in the available ring, and the VMM reads it directly from guest memory. No traps, no copies through the hypervisor, no emulated register reads. That's why virtio-net is fast. For this to work, the guest kernel needs to know about virtio-net. That means CONFIG_VIRTIO_NET. If the option isn't compiled in (or loadable as a module that actually exists in the rootfs), the guest boots with no network interface at all. Not a broken interface. No interface. You can spend a long time debugging Docker before you notice ip link inside the guest shows only loopback. vhost_net is worth knowing about separately. It moves the packet processing thread from Firecracker userspace into a kernel thread on the host, which cuts one context switch per packet. It's a performance optimization, not required for correctness. Firecracker enables it automatically when the host kernel supports it. Guest network configuration is either static (assign an IP to eth0, set a default route pointing at the TAP's host-side IP) or via MMDS. MMDS is Firecracker's metadata service, which lets the VMM inject configuration into the guest without a DHCP server. The guest queries a special IP (169.254.169.254 by default) and gets JSON back. For CI runners where the network config is known at VM creation time, MMDS is cleaner than running a DHCP daemon. Test layer 1 by pinging the guest's IP from the host. If that fails, you haven't left layer 1 yet. Everything above userspace (Docker, your CI runner, your application) depends on features being present in the guest kernel. Firecracker's reference kernels are deliberately minimal. They're built for short-lived single-process serverless functions, not for running Docker with service containers. When you use them for a different workload, you inherit the consequences of those build choices. This matters more than most people expect. A missing CONFIG_BRIDGE doesn't produce a clear error message saying "bridge support not compiled in." It produces confusing Docker startup failures or silent connectivity loss that looks like a routing problem. Three useful commands: Note that modules are only half the picture. A missing kernel feature isn't always "not a module." It might be a module that would work if the module file existed and modprobe could load it. On minimal microVM rootfs images, neither of those things is guaranteed. If lsmod doesn't show a module and modprobe can't find it, the feature doesn't exist in that environment regardless of how the kernel was compiled. These are the options that matter for running Docker inside a Firecracker guest. Each one is listed with what it enables and what breaks without it. You can't add these options without a recompile. The process: Each option you add increases the kernel binary size and extends the attack surface slightly. Add what you need, not everything. The reference config is minimal for a reason; your additions should be deliberate. On a full distro, systemd mounts pseudo-filesystems silently before any userspace process sees them. /proc, /sys, /sys/fs/cgroup, /dev, /run. You've never had to think about this because systemd handles it. In a microVM with a minimal shell script as PID 1, none of that happens. Not because of a bug. Because nobody asked it to. If your init script doesn't mount these paths, Docker will fail with errors that make no sense until you understand what's actually missing. /proc gives the kernel parameter interface. Without it, sysctl doesn't work, and /proc/modules doesn't exist, so you can't inspect what the kernel has loaded. /sys exposes kernel subsystem knobs and the device tree. Without it, many kernel features are technically present but unreachable from userspace. /sys/fs/cgroup is where Docker creates per-container resource limits. The failure message is Devices cgroup isn't mounted. When you see this, it doesn't mean Docker is broken. It means /sys/fs/cgroup doesn't exist as a mounted filesystem yet. /dev needs devtmpfs. Without it, device files don't exist and nothing that needs a device (including the GPU if you're going that route) will work. /run is where dockerd.sock lives. It needs to be a tmpfs. Without it, the socket path doesn't exist and Docker clients can't connect. /tmp is used by containerd for staging. It also needs to be a writable tmpfs. Each mount has a specific purpose. The cgroup2 line assumes a modern kernel (5.8+) and a Docker version that supports cgroup v2. On older setups, you'd use -t cgroup -o cpuset,cpu,cpuacct,blkio,memory,devices,freezer to mount specific v1 hierarchies. The v1/v2 distinction matters: if your rootfs has software with hardcoded paths like /sys/fs/cgroup/memory/, it expects v1 layout. Modern Docker and container runtimes work fine with v2. If dockerd fails with cgroup errors after you've added the mounts above, check the kernel config first. CONFIG_CGROUPS must be present. CONFIG_CGROUP_DEVICE is what backs the Devices cgroup specifically. With the kernel options present and pseudo-filesystems mounted, Docker can start. But there are still several more layers between a container and the internet. Here's the complete packet path from a container process to the outside world: Each arrow is a different mechanism. Most debugging mistakes come from not knowing which mechanism you're looking at. A network namespace is not just "a different set of interfaces." It's a complete isolated copy of the kernel networking subsystem: its own routing table, its own iptables rules, its own conntrack table, its own socket table. When you run ip route inside a container, you're reading a routing table that is entirely separate from the guest's routing table, which is entirely separate from the host's. Docker creates one per container via clone(CLONE_NEWNET). To inspect namespaces on a running system: ip netns list, or look at /proc/[pid]/ns/net for a specific process. nsenter -t [pid] -n drops you into a container's network namespace so you can run ip commands there directly. A veth pair is a two-ended pipe in kernel memory. Packets written to one end come out the other with no copying. Docker creates a pair for each container: one end goes inside the container's network namespace, the other stays in the guest's namespace attached to docker0. This requires CONFIG_VETH. On minimal kernels, it's frequently absent. Without it, Docker will fail to create containers with an obscure error about failing to create a network endpoint. docker0 is a Linux bridge (Layer-2 virtual switch) that maintains a MAC address table and forwards frames between attached interfaces. It also has an IP address (172.17.0.1 by default) that makes it the default gateway for containers. CONFIG_BRIDGE is required. But CONFIG_BRIDGE_NETFILTER is the one that's easy to miss and hard to debug. This option lets iptables rules see traffic that is being bridged rather than routed. Without it, container-to-container traffic on the same bridge bypasses the FORWARD chain entirely. NAT doesn't fire. Inter-container connectivity breaks in confusing ways depending on whether the kernel happens to route a particular packet or bridge it. Containers have private IPs (172.17.x.x) that aren't routable on the internet. MASQUERADE rewrites the source IP on outbound packets to the guest's eth0 IP, and conntrack (covered in the next section) reverses the rewrite for replies. For port publishing (-p 5432:5432), Docker adds a DNAT rule in PREROUTING that rewrites the destination IP and port before the routing decision is made, plus a rule in the FORWARD chain that lets the rewritten packet through. When things are working, Docker's iptables output looks like this: If iptables -t nat -L DOCKER is empty or throws an error, you haven't solved the kernel config problem yet. ![Netfilter internals(https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zkd8o956pnfa0zkpdnng.png) The most important thing to understand about Linux packet filtering: netfilter is the kernel hook system. iptables and nftables are userspace configuration tools. The actual packet interception and rule matching happens entirely in kernel code. Removing the iptables binary from your system does not disable packet filtering. It just means you can't configure it from userspace. Conversely, having the iptables binary present does nothing if the kernel doesn't have the right modules compiled in. Netfilter places hooks at 5 points in the kernel's packet processing path: PREROUTING fires on every incoming packet, before routing decisions are made. This is where DNAT lives (rewrite destination before the kernel decides where to send it). INPUT fires on packets destined for the local machine. FORWARD fires on packets being forwarded between interfaces. This is where Docker's container isolation rules live. OUTPUT fires on locally-generated packets, after routing. POSTROUTING fires on all outgoing packets, after routing. This is where MASQUERADE lives. Take a concrete packet: a container process connects to api.github.com:443. The packet starts in the container's net namespace, crosses the veth pair into the guest namespace, hits FORWARD (where Docker's rules say "this is allowed"), gets routed to eth0, hits POSTROUTING (where MASQUERADE rewrites the source IP), and exits through the virtio-net device to the host. Rules are organized into tables by purpose. filter handles allow/deny. nat handles address translation. mangle handles packet modification. Each table has its own set of chains for specific hooks. Docker's DOCKER chain is a user-defined chain. Docker creates it, adds it as a jump target from FORWARD, and inserts per-container rules. When you add a container with a published port, Docker appends a rule to the DOCKER chain. This is why iptables -L FORWARD shows a DOCKER target even though you never created one. Stateless packet filtering can't handle NAT alone. If you MASQUERADE an outbound packet, the reply comes back with the host's IP as its destination. Something needs to know that this reply belongs to a connection that started from a container at 172.17.0.2. That something is conntrack. Conntrack tracks both directions of a connection in a state machine (NEW, ESTABLISHED, RELATED, INVALID). When MASQUERADE fires on an outbound packet, conntrack records the original source IP and port alongside the translated version. When the reply arrives, conntrack recognizes it as ESTABLISHED, reverses the translation automatically, and delivers it to the container. CONFIG_NF_CONNTRACK must be compiled in. Without it, NAT fires on the first packet but replies get dropped because the reverse mapping doesn't exist. One production failure worth knowing: nf_conntrack: table full, dropping packet. This happens when a high-throughput workload (database CI suite, port scanner, whatever) fills the conntrack table. The table size is tunable: sysctl -w net.netfilter.nf_conntrack_max=131072. Check the current count with cat /proc/net/nf_conntrack | wc -l. This section is the root cause of the debugging case this post has been building toward. The Linux kernel has two separate packet filtering subsystems: x_tables (the traditional iptables backend, CONFIG_IP_NF_IPTABLES) and nf_tables (the modern successor, CONFIG_NF_TABLES). Rules written to one are invisible to the other. They have separate kernel data structures, separate Netlink socket types, and separate userspace tools. /usr/sbin/iptables is just a symlink. On modern distros it points to iptables-nft, which talks to the nf_tables subsystem. On older distros, it points to iptables-legacy, which talks to x_tables. You can check: ls -la /usr/sbin/iptables or iptables --version (nft variant says "nf_tables", legacy says "legacy"). Firecracker's reference kernels don't include CONFIG_NF_TABLES. The serverless function use case they were built for never needed it. So when Docker runs on a Firecracker guest with a stock reference kernel, it calls the iptables binary, which calls iptables-nft (because that's what modern distros default to), which opens a Netlink socket and sends a request to the nf_tables subsystem, which doesn't exist in the kernel. The kernel returns EPROTONOSUPPORT. Docker doesn't log this clearly. It logs something about failing to create a network or failing to set up iptables rules, which sounds like a permissions problem or a configuration problem. It's neither. The fix: redirect to iptables-legacy. This makes the iptables binary talk to x_tables instead, which Firecracker's kernel does have (CONFIG_IP_NF_IPTABLES). Docker starts creating rules. That part works now. But this is a shim, not a solution. All the other missing kernel options (CONFIG_BRIDGE, CONFIG_VETH, CONFIG_NF_NAT, CONFIG_NF_CONNTRACK) are still missing. The iptables fix unblocks Docker startup, and then you start seeing the next layer of failures. This is the dependency chain mentioned at the start: fix one layer, the next problem surfaces. That's not bad luck. That's how the stack works. TAP + masquerade (what we've covered) is the simplest option. The guest gets internet access but no presence on the host LAN. TAP + bridge attaches the TAP interface to a host bridge, which means the guest and host share the same broadcast domain. The guest gets an IP on the host's LAN and can be reached by other machines. More complex, but necessary for some CI setups. macvlan gives the guest its own MAC address on the physical NIC, so it appears as a separate machine on the network. This is useful when you want the guest to have a real LAN IP without bridge complexity. SR-IOV bypasses the kernel network stack almost entirely using hardware-level virtual functions. Each guest gets direct access to a slice of the physical NIC. Latency is much lower, but it requires hardware support (NIC + BIOS + kernel driver) and isn't available on most cloud VMs. --network host puts the container in the guest's network namespace directly. No bridge, no veth pair, no NAT. Whatever is listening on port 5432 in the container is immediately visible at 0.0.0.0:5432 on the guest's eth0. Good for testing and single-container setups. Bad for isolation. CNI plugins are the Kubernetes approach. Flannel wraps container traffic in a UDP/VXLAN overlay. Calico routes using BGP and Linux kernel routing, no overlay needed. Cilium replaces the entire iptables stack with eBPF and gives you per-flow observability as a side effect. Cilium requires kernel 5.10+ and takes a while to understand, but it's the right long-term choice for any setup at scale. Kata Containers runs one microVM per container, using containerd as the shim and a VMM (QEMU or Firecracker) as the backend. This is exactly what a manually configured fc-demo project builds, just automated. The performance and security tradeoff compared to runc containers is manageable on modern hardware. nftables native (not via the iptables compat shim) has cleaner syntax, set-based matching, and better performance at high rule counts. The syntax is more approachable once you get over the initial learning curve. It's the right choice for new deployments on kernels 5.2+. eBPF/XDP attaches programmable bytecode directly to kernel hooks. The bytecode runs in a sandboxed VM inside the kernel. It can fully replace iptables for packet filtering and gives you programmable packet processing at line rate. More complex to write and debug than iptables rules, but the performance ceiling is much higher. The most useful mental shift when debugging VM networking: each layer is independently verifiable. You don't need to guess whether the problem is Docker or the bridge or the kernel. You can test each one separately. Work top-down (start at the physical link, move up) or bottom-up (start at the app failure, work toward the hardware). Top-down finds root causes faster. Bottom-up is where most people start because that's where the error appears. Layer 1: Host to guest link Layer 2: Guest kernel network If ping 8.8.8.8 works from the guest but containers can't reach the internet, the problem is in layer 4 or 5, not here. Layer 3: Pseudo-filesystems and cgroups Layer 4: Docker bridge and veth Layer 5: Netfilter rules If Docker is running but these chains are empty, that's the iptables-nft vs iptables-legacy problem. Switch to legacy, restart Docker. Layer 6: Connection tracking If conntrack isn't tracking, check CONFIG_NF_CONNTRACK. If the table is close to the limit, increase it. Pull the running kernel's config and check for specific options: Any # CONFIG_X is not set means you need to rebuild the kernel to get that feature. There's no workaround. The feature doesn't exist. Firecracker's stripped kernel is not broken. It's correct for the workload it was designed for: short-lived, single-process, serverless functions that need fast startup and a minimal attack surface. Those functions don't run Docker. They don't need bridges or NAT or conntrack. The reference kernel reflects that. When you run a different workload, you own the kernel configuration. The host OS, the VMM, and the guest OS are all your configuration surface now. This is more configuration surface than you're used to if you've only worked with containers on full VMs or bare metal. But it's not fundamentally harder. It's just a wider map. The failure mode isn't "I don't know enough about networking." It's "I don't have a map of which layer I'm looking at." Hopefully you do now. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

sysctl -w net.ipv4.ip_forward=1 iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE sysctl -w net.ipv4.ip_forward=1 iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE sysctl -w net.ipv4.ip_forward=1 iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE # If the kernel was compiled with CONFIG_IKCONFIG_PROC: zcat /proc/config.gz | grep CONFIG_VIRTIO_NET # On a Debian/Ubuntu guest that has the config in /boot: grep CONFIG_BRIDGE /boot/config-$(uname -r) # Check what modules are loaded right now: lsmod # If the kernel was compiled with CONFIG_IKCONFIG_PROC: zcat /proc/config.gz | grep CONFIG_VIRTIO_NET # On a Debian/Ubuntu guest that has the config in /boot: grep CONFIG_BRIDGE /boot/config-$(uname -r) # Check what modules are loaded right now: lsmod # If the kernel was compiled with CONFIG_IKCONFIG_PROC: zcat /proc/config.gz | grep CONFIG_VIRTIO_NET # On a Debian/Ubuntu guest that has the config in /boot: grep CONFIG_BRIDGE /boot/config-$(uname -r) # Check what modules are loaded right now: lsmod CONFIG_VIRTIO_NET → guest NIC driver. Without it, the VM has no network at all. CONFIG_BRIDGE → Linux bridge (docker0). Without it, Docker cannot create a bridge interface. CONFIG_VETH → virtual ethernet pairs. Without it, containers have no host-side interface. CONFIG_NETFILTER → the entire packet filtering framework. Without it, no iptables, no NAT. CONFIG_NF_TABLES → nf_tables subsystem (modern iptables backend). Missing means iptables-nft fails with EPROTONOSUPPORT. CONFIG_IP_NF_IPTABLES → x_tables subsystem (legacy iptables backend). Missing means iptables-legacy fails. CONFIG_NF_NAT → NAT support (MASQUERADE, DNAT). Without it, no port publishing. CONFIG_NF_CONNTRACK → stateful connection tracking. Without it, NAT only works for the first packet of a connection. CONFIG_BRIDGE_NETFILTER → lets iptables see bridged traffic. Without it, bridged containers bypass NAT entirely. CONFIG_CGROUPS → control group framework. Docker needs this to exist before it will start. CONFIG_CGROUP_DEVICE → device access control per container. CONFIG_CGROUP_NET_PRIO → network priority per cgroup. CONFIG_INET → basic IPv4 support. Catastrophic if missing. CONFIG_IPV6 → IPv6. Some tooling breaks without it even if you're not using IPv6 addresses. CONFIG_VIRTIO_NET → guest NIC driver. Without it, the VM has no network at all. CONFIG_BRIDGE → Linux bridge (docker0). Without it, Docker cannot create a bridge interface. CONFIG_VETH → virtual ethernet pairs. Without it, containers have no host-side interface. CONFIG_NETFILTER → the entire packet filtering framework. Without it, no iptables, no NAT. CONFIG_NF_TABLES → nf_tables subsystem (modern iptables backend). Missing means iptables-nft fails with EPROTONOSUPPORT. CONFIG_IP_NF_IPTABLES → x_tables subsystem (legacy iptables backend). Missing means iptables-legacy fails. CONFIG_NF_NAT → NAT support (MASQUERADE, DNAT). Without it, no port publishing. CONFIG_NF_CONNTRACK → stateful connection tracking. Without it, NAT only works for the first packet of a connection. CONFIG_BRIDGE_NETFILTER → lets iptables see bridged traffic. Without it, bridged containers bypass NAT entirely. CONFIG_CGROUPS → control group framework. Docker needs this to exist before it will start. CONFIG_CGROUP_DEVICE → device access control per container. CONFIG_CGROUP_NET_PRIO → network priority per cgroup. CONFIG_INET → basic IPv4 support. Catastrophic if missing. CONFIG_IPV6 → IPv6. Some tooling breaks without it even if you're not using IPv6 addresses. CONFIG_VIRTIO_NET → guest NIC driver. Without it, the VM has no network at all. CONFIG_BRIDGE → Linux bridge (docker0). Without it, Docker cannot create a bridge interface. CONFIG_VETH → virtual ethernet pairs. Without it, containers have no host-side interface. CONFIG_NETFILTER → the entire packet filtering framework. Without it, no iptables, no NAT. CONFIG_NF_TABLES → nf_tables subsystem (modern iptables backend). Missing means iptables-nft fails with EPROTONOSUPPORT. CONFIG_IP_NF_IPTABLES → x_tables subsystem (legacy iptables backend). Missing means iptables-legacy fails. CONFIG_NF_NAT → NAT support (MASQUERADE, DNAT). Without it, no port publishing. CONFIG_NF_CONNTRACK → stateful connection tracking. Without it, NAT only works for the first packet of a connection. CONFIG_BRIDGE_NETFILTER → lets iptables see bridged traffic. Without it, bridged containers bypass NAT entirely. CONFIG_CGROUPS → control group framework. Docker needs this to exist before it will start. CONFIG_CGROUP_DEVICE → device access control per container. CONFIG_CGROUP_NET_PRIO → network priority per cgroup. CONFIG_INET → basic IPv4 support. Catastrophic if missing. CONFIG_IPV6 → IPv6. Some tooling breaks without it even if you're not using IPv6 addresses. #!/bin/sh set -e mount -t proc none /proc mount -t sysfs none /sys mount -t devtmpfs none /dev mount -t tmpfs none /run mount -t tmpfs none /tmp # cgroup v2 unified hierarchy (for kernels 5.8+ and modern Docker) mount -t cgroup2 none /sys/fs/cgroup # IP forwarding — needed if containers want to reach the internet echo 1 > /proc/sys/net/ipv4/ip_forward # configure the guest NIC ip addr add 192.168.0.2/24 dev eth0 ip link set eth0 up ip route add default via 192.168.0.1 exec /usr/bin/dockerd --host unix:///run/docker.sock #!/bin/sh set -e mount -t proc none /proc mount -t sysfs none /sys mount -t devtmpfs none /dev mount -t tmpfs none /run mount -t tmpfs none /tmp # cgroup v2 unified hierarchy (for kernels 5.8+ and modern Docker) mount -t cgroup2 none /sys/fs/cgroup # IP forwarding — needed if containers want to reach the internet echo 1 > /proc/sys/net/ipv4/ip_forward # configure the guest NIC ip addr add 192.168.0.2/24 dev eth0 ip link set eth0 up ip route add default via 192.168.0.1 exec /usr/bin/dockerd --host unix:///run/docker.sock #!/bin/sh set -e mount -t proc none /proc mount -t sysfs none /sys mount -t devtmpfs none /dev mount -t tmpfs none /run mount -t tmpfs none /tmp # cgroup v2 unified hierarchy (for kernels 5.8+ and modern Docker) mount -t cgroup2 none /sys/fs/cgroup # IP forwarding — needed if containers want to reach the internet echo 1 > /proc/sys/net/ipv4/ip_forward # configure the guest NIC ip addr add 192.168.0.2/24 dev eth0 ip link set eth0 up ip route add default via 192.168.0.1 exec /usr/bin/dockerd --host unix:///run/docker.sock container process (172.17.0.2) └─ eth0 (veth1, inside container net namespace) ↕ veth pair — kernel memory copy vethXXXXXX (veth0, host end, in guest net namespace) └─ docker0 bridge (172.17.0.1, Layer-2 switch) └─ routing + netfilter FORWARD + MASQUERADE └─ eth0 (virtio-net, guest uplink) ↕ virtio ring buffer TAP device (host kernel) └─ host routing + host MASQUERADE └─ host physical NIC → internet container process (172.17.0.2) └─ eth0 (veth1, inside container net namespace) ↕ veth pair — kernel memory copy vethXXXXXX (veth0, host end, in guest net namespace) └─ docker0 bridge (172.17.0.1, Layer-2 switch) └─ routing + netfilter FORWARD + MASQUERADE └─ eth0 (virtio-net, guest uplink) ↕ virtio ring buffer TAP device (host kernel) └─ host routing + host MASQUERADE └─ host physical NIC → internet container process (172.17.0.2) └─ eth0 (veth1, inside container net namespace) ↕ veth pair — kernel memory copy vethXXXXXX (veth0, host end, in guest net namespace) └─ docker0 bridge (172.17.0.1, Layer-2 switch) └─ routing + netfilter FORWARD + MASQUERADE └─ eth0 (virtio-net, guest uplink) ↕ virtio ring buffer TAP device (host kernel) └─ host routing + host MASQUERADE └─ host physical NIC → internet # NAT table $ iptables -t nat -L POSTROUTING --line-numbers Chain POSTROUTING (policy ACCEPT) num target prot opt source destination 1 MASQUERADE all -- 172.17.0.0/16 !172.17.0.0/16 $ iptables -t nat -L DOCKER --line-numbers Chain DOCKER (2 references) num target prot opt source destination 1 RETURN all -- anywhere anywhere 2 DNAT tcp -- anywhere anywhere tcp dpt:5432 to:172.17.0.2:5432 # Filter table $ iptables -L FORWARD --line-numbers Chain FORWARD (policy DROP) num target prot opt source destination 1 DOCKER-USER all -- anywhere anywhere 2 DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere 3 ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED 4 DOCKER all -- anywhere anywhere 5 ACCEPT all -- anywhere 172.17.0.0/16 # NAT table $ iptables -t nat -L POSTROUTING --line-numbers Chain POSTROUTING (policy ACCEPT) num target prot opt source destination 1 MASQUERADE all -- 172.17.0.0/16 !172.17.0.0/16 $ iptables -t nat -L DOCKER --line-numbers Chain DOCKER (2 references) num target prot opt source destination 1 RETURN all -- anywhere anywhere 2 DNAT tcp -- anywhere anywhere tcp dpt:5432 to:172.17.0.2:5432 # Filter table $ iptables -L FORWARD --line-numbers Chain FORWARD (policy DROP) num target prot opt source destination 1 DOCKER-USER all -- anywhere anywhere 2 DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere 3 ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED 4 DOCKER all -- anywhere anywhere 5 ACCEPT all -- anywhere 172.17.0.0/16 # NAT table $ iptables -t nat -L POSTROUTING --line-numbers Chain POSTROUTING (policy ACCEPT) num target prot opt source destination 1 MASQUERADE all -- 172.17.0.0/16 !172.17.0.0/16 $ iptables -t nat -L DOCKER --line-numbers Chain DOCKER (2 references) num target prot opt source destination 1 RETURN all -- anywhere anywhere 2 DNAT tcp -- anywhere anywhere tcp dpt:5432 to:172.17.0.2:5432 # Filter table $ iptables -L FORWARD --line-numbers Chain FORWARD (policy DROP) num target prot opt source destination 1 DOCKER-USER all -- anywhere anywhere 2 DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere 3 ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED 4 DOCKER all -- anywhere anywhere 5 ACCEPT all -- anywhere 172.17.0.0/16 update-alternatives --set iptables /usr/sbin/iptables-legacy update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy update-alternatives --set iptables /usr/sbin/iptables-legacy update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy update-alternatives --set iptables /usr/sbin/iptables-legacy update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy # On the host ip link show tap0 # is the TAP up? ip addr show tap0 # does it have an IP? ping [guest IP] -c3 # can the host reach the guest? # On the host ip link show tap0 # is the TAP up? ip addr show tap0 # does it have an IP? ping [guest IP] -c3 # can the host reach the guest? # On the host ip link show tap0 # is the TAP up? ip addr show tap0 # does it have an IP? ping [guest IP] -c3 # can the host reach the guest? # Inside the guest ip link show eth0 # is virtio-net up? ip addr show eth0 # does it have an IP? ip route show # is there a default route? ping 8.8.8.8 -c3 # can the guest reach the internet? cat /proc/sys/net/ipv4/ip_forward # should be 1 # Inside the guest ip link show eth0 # is virtio-net up? ip addr show eth0 # does it have an IP? ip route show # is there a default route? ping 8.8.8.8 -c3 # can the guest reach the internet? cat /proc/sys/net/ipv4/ip_forward # should be 1 # Inside the guest ip link show eth0 # is virtio-net up? ip addr show eth0 # does it have an IP? ip route show # is there a default route? ping 8.8.8.8 -c3 # can the guest reach the internet? cat /proc/sys/net/ipv4/ip_forward # should be 1 mount | grep cgroup # is cgroup filesystem mounted? ls /sys/fs/cgroup # can Docker see the cgroup hierarchy? mount | grep "tmpfs on /run" # is /run a tmpfs? ls /run/docker.sock # does the socket exist? mount | grep cgroup # is cgroup filesystem mounted? ls /sys/fs/cgroup # can Docker see the cgroup hierarchy? mount | grep "tmpfs on /run" # is /run a tmpfs? ls /run/docker.sock # does the socket exist? mount | grep cgroup # is cgroup filesystem mounted? ls /sys/fs/cgroup # can Docker see the cgroup hierarchy? mount | grep "tmpfs on /run" # is /run a tmpfs? ls /run/docker.sock # does the socket exist? ip link show docker0 # does the bridge exist? bridge link show # are any veth peers attached? ip addr show docker0 # is 172.17.0.1 assigned? docker network inspect bridge # what does Docker think is happening? ip link show docker0 # does the bridge exist? bridge link show # are any veth peers attached? ip addr show docker0 # is 172.17.0.1 assigned? docker network inspect bridge # what does Docker think is happening? ip link show docker0 # does the bridge exist? bridge link show # are any veth peers attached? ip addr show docker0 # is 172.17.0.1 assigned? docker network inspect bridge # what does Docker think is happening? iptables --version # nf_tables or legacy? iptables -t nat -L DOCKER # are DNAT rules present? iptables -t nat -L POSTROUTING # is MASQUERADE present? iptables -L FORWARD | grep DOCKER # are forward rules present? iptables --version # nf_tables or legacy? iptables -t nat -L DOCKER # are DNAT rules present? iptables -t nat -L POSTROUTING # is MASQUERADE present? iptables -L FORWARD | grep DOCKER # are forward rules present? iptables --version # nf_tables or legacy? iptables -t nat -L DOCKER # are DNAT rules present? iptables -t nat -L POSTROUTING # is MASQUERADE present? iptables -L FORWARD | grep DOCKER # are forward rules present? conntrack -L 2>/dev/null | head -20 # are connections being tracked? cat /proc/net/nf_conntrack | wc -l # how many entries in the table? sysctl net.netfilter.nf_conntrack_max # what's the limit? conntrack -L 2>/dev/null | head -20 # are connections being tracked? cat /proc/net/nf_conntrack | wc -l # how many entries in the table? sysctl net.netfilter.nf_conntrack_max # what's the limit? conntrack -L 2>/dev/null | head -20 # are connections being tracked? cat /proc/net/nf_conntrack | wc -l # how many entries in the table? sysctl net.netfilter.nf_conntrack_max # what's the limit? # If /proc/config.gz exists: zcat /proc/config.gz | grep -E "CONFIG_(BRIDGE|VETH|NF_TABLES|NF_NAT|NF_CONNTRACK|BRIDGE_NETFILTER|VIRTIO_NET)" # Expected output for a Docker-capable kernel: CONFIG_BRIDGE=y CONFIG_VETH=y CONFIG_NF_TABLES=y # or m, if the module is loadable CONFIG_NF_NAT=y CONFIG_NF_CONNTRACK=y CONFIG_BRIDGE_NETFILTER=y CONFIG_VIRTIO_NET=y # If /proc/config.gz exists: zcat /proc/config.gz | grep -E "CONFIG_(BRIDGE|VETH|NF_TABLES|NF_NAT|NF_CONNTRACK|BRIDGE_NETFILTER|VIRTIO_NET)" # Expected output for a Docker-capable kernel: CONFIG_BRIDGE=y CONFIG_VETH=y CONFIG_NF_TABLES=y # or m, if the module is loadable CONFIG_NF_NAT=y CONFIG_NF_CONNTRACK=y CONFIG_BRIDGE_NETFILTER=y CONFIG_VIRTIO_NET=y # If /proc/config.gz exists: zcat /proc/config.gz | grep -E "CONFIG_(BRIDGE|VETH|NF_TABLES|NF_NAT|NF_CONNTRACK|BRIDGE_NETFILTER|VIRTIO_NET)" # Expected output for a Docker-capable kernel: CONFIG_BRIDGE=y CONFIG_VETH=y CONFIG_NF_TABLES=y # or m, if the module is loadable CONFIG_NF_NAT=y CONFIG_NF_CONNTRACK=y CONFIG_BRIDGE_NETFILTER=y CONFIG_VIRTIO_NET=y - Guest has no default route. Ping 8.8.8.8 and nothing happens. ip route show inside the guest shows no default. - TAP device is down on the host. ip link show tap0 shows DOWN. Bring it up with ip link set tap0 up. - ip_forward is disabled. Packets from the guest arrive at the host's TAP interface and go nowhere. The guest can ping the host's TAP IP but not anything beyond it. - Missing MASQUERADE rule. Guest can reach the host but not the internet. iptables -t nat -L POSTROUTING shows nothing relevant. - Clone the Linux source at the version Firecracker targets. Firecracker's repo documents this under resources/. - Start from Firecracker's reference config (resources/guest_configs/microvm-kernel-x86_64-*.config). - Enable the missing options via make menuconfig. Each one shows as y (built-in), m (loadable module), or not set. - Compile: make vmlinux -j$(nproc). - Replace the vmlinux in your boot config.