Tools: Latest: Network Troubleshooting Runbook for DigitalOcean VPCs
Source: DigitalOcean
Network incidents on a DigitalOcean VPC get resolved faster when every responder follows the same diagnostic sequence instead of improvising. This tutorial walks through building a reusable runbook for common VPC failures using ping, traceroute, tcpdump, nmap, dig, nslookup, nmcli, ss, and curl, then shows how to turn the most repeatable checks into automated DigitalOcean Functions. A runbook removes the guesswork during incidents. Every responder starts at the same layer, runs the same commands, and produces the same evidence. That consistency matters most during critical incidents when a responder needs to isolate a Cloud Firewall block from an in-Droplet firewall block before escalating. In this tutorial, you will learn how to diagnose VPC connectivity failures, distinguish between Cloud Firewall and in-Droplet firewall blocks, and build a structured network troubleshooting runbook. Before following this tutorial, you need: Confirm the install with tcpdump --version and nmap --version. Both should print a version string rather than command not found. DigitalOcean VPC troubleshooting gets faster when you can separate packet path controls, interface scope, and health-check flow before inspecting any logs. Cloud Firewalls filter traffic at the hypervisor layer before packets reach the Droplet interface. In-Droplet tools such as ufw, iptables, or nftables run inside the operating system after the packet arrives. This distinction determines your entire diagnostic path: Both layers can be active simultaneously and each enforces its rules independently. Use the following steps to check and modify Cloud Firewall rules. To list all inbound rules on a Cloud Firewall using doctl: Copy the ID from the first column of the firewall you want to inspect. The next command uses this UUID to fetch the complete rule set. Retrieve the full rule set for that firewall using its ID: To add a missing inbound rule that allows TCP traffic on port 5432 from a specific Droplet, substitute <firewall-id> with the ID returned by the earlier doctl compute firewall list command and <droplet-id> with the source Droplet’s ID: You can also manage rules in the DigitalOcean control panel under Networking > Firewalls. Select the firewall, go to the Inbound Rules tab, and add the required protocol, port range, and source. Note: Cloud Firewall rule changes take effect without requiring a Droplet or service restart. Use the following steps to check and modify in-Droplet firewall rules. If tcpdump confirms packets are arriving but the service is not responding, check ufw status and active rules: Check the To and Action columns. Any port not listed here is blocked by default. If the port your service uses is absent, add it with the command below. To check raw iptables rules if ufw is not in use: In the output above, ACCEPT rules for ports 22 and 80 let those services through, while the DROP rule on port 5432 blocks PostgreSQL traffic before it reaches the application. A DROP rule with no preceding ACCEPT for your service port means the in-Droplet firewall is blocking traffic before it reaches the application. VPC traffic traverses a private interface, typically eth1 or, on newer networkd-based images, ens4. Capturing on the wrong interface produces false negatives, so always confirm interface naming before running any packet capture. Run ip link show to list all network interfaces and identify the VPC private interface by name. On standard Ubuntu Droplet images, eth0 is the public interface and eth1 is the VPC private interface: In this output, eth1 is the VPC private interface. Use this name for all subsequent tcpdump, nmap, and ss commands in this tutorial. If your Droplet uses networkd-based naming, you will see ens3 and ens4 instead. Confirm the private IP assigned to the VPC interface: Load Balancer availability depends on successful health checks to backend Droplets. A backend fails health checks when the process is bound to the wrong interface, when local firewall rules block probe traffic, or when the health check path returns a non-2xx status code. The diagnostic steps for this failure mode are covered in Step 5. Map each tool to a troubleshooting layer and use them in a fixed order from the bottom of the network stack upward. Start by confirming path existence and loss profile, because reachability failures make later service-level checks irrelevant until routing is fixed. If ping fails to a VPC-private IP, verify that your Cloud Firewall inbound rules permit ICMP traffic from the source Droplet’s IP or tag. ICMP is not allowed by default unless explicitly added as an inbound rule. If packet loss is high or complete, run traceroute to identify where traffic diverges: In a healthy VPC, traffic between Droplets arrives in one hop, though traceroute output can vary depending on ICMP handling along the path. If you see multiple hops or repeated * * * responses, use the following guide to interpret the output: For hosts using NetworkManager, verify the active profile, interface state, and DNS settings in one command: If eth1 shows a state other than connected, the VPC private interface is not active. This typically means the Droplet was not created with a VPC assigned or the interface was manually disabled inside the OS. When ping passes but a service still fails, tcpdump is the next stop. Packet capture answers two questions cleanly: did the traffic reach the destination interface, and did the TCP handshake actually complete. Start with a single-interface capture on the VPC interface: The -ni flag combination disables reverse DNS lookups and sets the interface. TCP flags to recognize: Use narrower filters to reduce noise during incidents. Filter by source IP when you want to isolate traffic from one specific Droplet: Filter by port when you want to see all traffic on a specific service, regardless of Combine both filters to narrow the capture to a specific source and port pair: Save captures to a PCAP file for later comparison or Wireshark analysis: Measure SYN to SYN-ACK handshake latency directly from capture output. The -tt flag outputs raw Unix timestamps, which the awk script uses to calculate the time between a SYN packet and its matching SYN-ACK. To avoid mixing up concurrent connections to the same host, the script uses both TCP endpoints as the lookup key: $3 and $5 for the SYN, and the reversed pair for the SYN-ACK. The gsub calls strip trailing colons from the address fields, which older tcpdump versions (4.9.x) append and newer versions (4.99.x) do not, so the same script works across both: Warning: PCAP files can contain sensitive data including internal IPs, hostnames, and application payload metadata. Store PCAP files in restricted locations and delete them after incident closure according to your data retention policy. nmap answers one question that matters during incidents: is this port open, filtered, or closed. Each answer points to a different fix, so the state matters more than the raw output. Run a targeted TCP port check. The -Pn flag skips host discovery, which you always want inside a VPC because Cloud Firewalls block ICMP by default and ICMP-based host discovery will falsely mark every Droplet as down: Interpret port states consistently and use each state to drive the next action: Run a UDP scan when DNS or other UDP-based protocols are involved: UDP ports return open|filtered when nmap cannot confirm the state because no ICMP unreachable response was received. This is expected behavior for UDP. Treat open|filtered as potentially open and verify at the application layer. Run service version detection to confirm the expected daemon is answering: Version detection catches two common production incidents: a deploy that failed to roll forward (the running version will not match the expected release), and a Droplet that was reverted to a prior image after a rollback. If the version shown does not match what you deployed, the wrong daemon is answering the port. If a port shows closed here but appeared open in an earlier scan, the service stopped between the two scans. Run a subnet ping scan to discover all live hosts in the VPC: If a Droplet you expect to find does not appear in the results, verify it is powered on and assigned to the same VPC. A missing host here points to a Droplet state or VPC assignment issue, not a firewall rule. Warning: Only run nmap against infrastructure you own and have explicit permission to scan. Running aggressive scans against external hosts may violate DigitalOcean’s Terms of Service and applicable laws. A service that answers on its IP address but fails by hostname is a DNS issue, not a network issue, and chasing it as a network problem wastes time. The sequence below isolates whether the resolver is wrong, whether the query is leaving the Droplet, and whether the two DigitalOcean resolvers agree on the answer. Check resolver configuration first: If both DigitalOcean resolver IPs (67.207.67.2 and 67.207.67.3) appear here, resolver configuration is likely correct for a default DigitalOcean setup. On Ubuntu 24.04 Droplets using systemd-resolved, /etc/resolv.conf points to the local stub resolver at 127.0.0.53 instead. Run the following command to confirm the actual upstream resolvers in use: If the upstream resolvers shown are not the DigitalOcean IPs, the Droplet is using a custom or overridden DNS configuration. Query the primary resolver directly. The +noall +answer flags suppress the header and footer sections so only the answer record is shown, which is easier to compare across resolvers during drift detection. Compare against the secondary resolver to detect resolver drift: Run an iterative trace when delegation issues or stale records are suspected: Use nslookup for a quick second-opinion query: Cross-check DNS egress by capturing port 53 traffic while running a dig query in a second terminal: Each packet line encodes: timestamp, source IP and port, destination IP and port, query ID, query type, and payload size. In this output, 10.116.0.12 sends query ID 61545 for an A record on api.internal.example to 67.207.67.2 on port 53, and the resolver replies 0.5 ms later with one answer (1/0/0) resolving to 10.116.0.22. This confirms both outbound reachability to the resolver and a successful response. If query packets are leaving the interface and responses are returning, the resolver is reachable and responding. If no packets appear at all, first verify you are capturing on the correct interface and that the query is leaving the host as expected. If both are true, the issue likely lies in interface state, routing, or resolver reachability rather than the DNS record itself. If one Droplet resolves and another fails, compare cat /etc/resolv.conf, nmcli device show eth1, and direct dig @resolver queries side by side to isolate whether the issue is per-Droplet configuration or upstream. A Load Balancer returning 502 or 503 almost always points to one of three things: the backend process isn’t listening on the health check port, the health path is returning a non-200 status code, or a Cloud Firewall rule is dropping probe traffic from the Load Balancer’s private IP. Work through the checks below in that order before changing anything, because changing config without knowing which of the three is at fault usually makes the problem harder to pin down. Note: To find your Load Balancer’s private IP, navigate to Networking > Load Balancers in the DigitalOcean control panel, select the Load Balancer, and check the Settings tab for the private IP address assigned to your VPC. Test frontend reachability from within the VPC: Validate backend listener state on each Droplet: If the port does not appear in the LISTEN state, the application is not running or is bound to 127.0.0.1 only, which is invisible to the Load Balancer probing over the VPC private interface. Inspect the health endpoint response on the backend directly: Capture health check probes arriving from the Load Balancer: The output shows a complete TCP three-way handshake. 10.116.0.50 (the Load Balancer) sends SYN (flag S), the backend 10.116.0.22 responds with SYN-ACK (flag S.), and the Load Balancer acknowledges with ACK (flag .). Seeing all three in sequence confirms the probe reaches the backend and the backend responds at the TCP layer. If you see the S flag but no S., the backend is not accepting the connection despite packets arriving. If probe packets are absent, inspect Cloud Firewall rules for the Load Balancer source IP. If probes are present but the backend is still marked unhealthy, inspect in-Droplet firewall policy and the health path response code. Two additional edge cases to check when probes are present and the health endpoint returns 200 locally but the backend remains unhealthy in the control panel: The following runbook sequences the diagnostic steps above into a structured, repeatable incident response procedure. Each entry is organized by observable symptom so operators can locate the correct procedure quickly during active incidents. Symptom: Droplet A cannot connect to Droplet B’s private IP on a known service port. Confirm interface and private IP on the destination Droplet: Test baseline reachability from the source Droplet: Confirm a listener exists on the destination for the expected port: Capture expected traffic on the destination interface while retrying the connection from the Classify the root cause based on capture evidence: Symptom: Client requests to the Load Balancer return 502 or 503, or the target pool reports unhealthy backends. Confirm the Load Balancer frontend path is responding: Confirm the backend process is listening on the health check port: Verify the health path returns a 2xx response on the backend directly: Capture health check probes from the Load Balancer IP: Classify the root cause based on capture evidence: Symptom: Application reports host resolution failures for internal or external names. Reference: For detailed output interpretation and resolver drift diagnosis, see Step 4 - Diagnose DNS with dig and nslookup. Inspect configured resolvers on the failing Droplet: Query the DigitalOcean resolvers directly to confirm they are reachable and returning answers: Run a trace for delegation issues on public domains: Validate that DNS packets leave the host and receive replies on the active egress path: Classify the root cause based on results: Symptom: A service is reachable most of the time but shows intermittent connection failures, sporadic timeouts, or latency spikes that fall outside normal baseline. Measure sustained reachability and loss profile over a longer interval than a default ping run: Capture TCP handshake traffic for the affected service while the incident is active: Measure handshake latency from the live capture to quantify the spike: Check socket-level drop counters and retransmit counts on the destination: Classify the root cause based on measurements: Runbook steps that have explicit inputs, deterministic commands, and structured outputs are candidates for automation with DigitalOcean Functions. DigitalOcean Functions is the serverless layer used by function routing in DigitalOcean AI Platform agents. Routed functions must be web functions, return output in a body key, and use an input schema aligned with OpenAPI 3.0 when registered through the control panel or API. Review the official reference before implementation: Route Functions in Agents. This pattern mirrors the approach described in Modal’s autoscaling autoresearch post, where deterministic, repeatable operations become callable units that an agent can orchestrate. The runbook entries in this tutorial are that foundation: each step is scoped, produces structured output, and maps directly to a function an agent can call. The following example function accepts a target IP and port, runs an nmap scan, and returns the port state in the format required by AI Platform function routing: A representative function response looks like this: Once deployed and registered as a function route in an AI Platform agent, the agent can call this diagnostic automatically in response to a message such as “port 443 on 10.116.0.22 is not responding,” then use the structured output to route the response, escalate to a human, or trigger a follow-up check. If you are integrating with agent logic in Python, you can use the AI Python SDK. Note: For DigitalOcean AI Platform function routing, the function must be a DigitalOcean web function and must return output in a body key. Your route definition must follow the schema requirements in the Route Functions in Agents documentation. What is a network troubleshooting runbook? A network troubleshooting runbook is a documented sequence of diagnostic steps for specific symptoms, including commands, expected outputs, interpretation rules, and escalation paths. It standardizes incident response and ensures consistent diagnostic evidence regardless of which team member responds. How do I use tcpdump to debug network issues inside a DigitalOcean VPC? Run sudo tcpdump -ni eth1 on the target Droplet, substituting eth1 for your VPC private interface name. Use filters such as port 443 or src 10.116.0.12 and port 443 to narrow the capture to relevant traffic. If no packets from the expected source appear in the capture, the block is upstream at the Cloud Firewall level. If packets are present but sessions fail, investigate in-Droplet firewall policy, service binding, and application behavior. What is the difference between DigitalOcean Cloud Firewalls and in-Droplet firewalls like ufw? Cloud Firewalls filter traffic at the hypervisor level before it reaches the Droplet’s network interface. Traffic blocked by a Cloud Firewall will not appear in tcpdump captures inside the Droplet at all. ufw and iptables run inside the OS after the packet has already been delivered to the interface. Both layers can be active simultaneously and each enforces its rules independently. Absent packets in tcpdump point to a Cloud Firewall block; present packets without a service response point to an in-Droplet firewall or application issue. How can nmap help troubleshoot DigitalOcean Load Balancer and VPC connectivity problems? nmap classifies ports as open, filtered, or closed, which maps directly to the next action during incident response. Run nmap -Pn -p <port> <target-ip> from a Droplet inside the same VPC to test reachability toward a backend Droplet or the Load Balancer IP. A filtered result moves investigation to Cloud Firewall rules. A closed result moves investigation to service startup and bind address configuration. Can these runbook steps be automated? Yes. Steps that accept structured inputs such as an IP address, port, or hostname, and return structured outputs such as port state, DNS result, or packet count, are candidates for DigitalOcean Functions. Those functions can then be registered as function routes in an AI Platform agent, enabling the agent to execute diagnostic steps automatically in response to a symptom description. For the function requirements and schema format, see the Route Functions in Agents documentation. How do I know which interface is my VPC interface without guessing between eth1 and ens4? Run the following command to identify the interface associated with your private VPC IP: This lists all private IP addresses. Match the IP to the interface name shown above it (for example, eth1 or ens4). That interface is your VPC interface and should be used for tcpdump, ss, and other diagnostics. Alternatively, you can use: This shows which interface the system uses to reach another Droplet in the VPC. What do I do if my Droplet was created before the current VPC was configured? Droplets created before a VPC was assigned will not automatically join that VPC. To fix this: After reattachment, verify the interface appears with a private IP using: If no private interface appears, the Droplet was not successfully attached to the VPC. How do I correlate a tcpdump timestamp with an application log? Use tcpdump with high-precision timestamps and align them with your application logs. The -tttt flag prints human-readable timestamps with date and time. Compare these timestamps directly with your application logs to match events such as incoming requests, connection attempts, or timeouts. For stricter correlation during incidents: This helps confirm whether delays occur at the network layer (packet arrival) or inside the application (processing latency). This tutorial covered building an incident-oriented workflow for troubleshooting DigitalOcean VPCs, from reachability and packet capture to DNS validation, Load Balancer diagnostics, and symptom-based runbook entries organized for fast incident response. You can now troubleshoot Droplet connectivity failures, identify Cloud Firewall versus in-Droplet firewall behavior, diagnose DNS resolution issues, and automate repeatable checks as DigitalOcean Functions for DigitalOcean AI Platform agent routes. For next steps, deepen routing analysis with the VPC peering technical deep dive, then extend your validation workflow with the guide on testing firewall configurations with nmap and tcpdump. Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases. Learn more about our products Building future-ready infrastructure with Linux, Cloud, and DevOps. Full Stack Developer & System Administrator. Technical Writer @ DigitalOcean | GitHub Contributor | Passionate about Docker, PostgreSQL, and Open Source | Exploring NLP & AI-TensorFlow | Nailed over 50+ deployments across production environments. This textbox defaults to using Markdown to format your answer. You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link! Please complete your information! Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation. Full documentation for every DigitalOcean product. The Wave has everything you need to know about building a business, from raising funding to marketing your product. Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter. New accounts only. By submitting your email you agree to our Privacy Policy Scale up as you grow — whether you're running one virtual machine or ten thousand. From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.