Tools: Building a Real-Time DDoS Detection Engine from Scratch: HNG DevOps Stage 3 (2026)
A Quick Recap
The Task
What the Project Does and Why It Matters
Step 1: Setting Up the Server
Step 2: Setting Up the DuckDNS Subdomain
Step 3: Writing the Code
Step 4: Deploying on the Server
Step 5: The First Problem — psutil Failed to Build
Step 6: The Second Problem — Nextcloud Architecture Mismatch
Step 7: Verifying the Stack
Step 8: The Third Problem — Slack Webhook Returning 404
Step 9: The Fourth Problem — iptables Inside Docker
Step 10: Testing Everything End to End
How the Detection Works (For Beginners)
The Sliding Window
The Rolling Baseline
How Detection Makes a Decision
iptables Blocking
The Live Dashboard
Final Verification
The Big Picture This is part of my HNG DevOps internship series. Follow along as I document every stage.
Previous articles:Stage 0: How I Secured a Linux Server from ScratchStage 1: Build, Deploy and Reverse Proxy a Rust API
Stage 2: Containerizing a Microservices App with Docker and CI/CD Stage 0 was server hardening. Stage 1 was deploying an API. Stage 2 was containerization and CI/CD. Stage 3 is something different entirely. This time the task was to build a security tool from scratch. The scenario: I have been hired as a DevSecOps engineer at a cloud storage company running Nextcloud. After a wave of suspicious traffic, my job is to build a daemon that watches every HTTP request in real time, learns what normal looks like, and automatically blocks attackers when traffic goes abnormal. No Fail2Ban. No rate-limiting libraries. Build it yourself. The repository is here: https://github.com/GideonBature/hng-stage3 Here is a summary of what needed to be built: Every public web server on the internet gets attacked. Sometimes it is a single IP sending thousands of requests per second trying to overwhelm your server. Sometimes it is a distributed flood from many IPs at once. Both are called Denial of Service attacks and they can take a real service offline completely. Traditional tools scan log files periodically, which introduces a delay. What I built here reads the log file line by line in real time as Nginx writes it, makes a decision within one second, and blocks the offending IP at the firewall level before the attack can do serious damage. The full data flow looks like this: I reused the same Oracle Cloud server from Stages 0, 1, and 2. It runs Ubuntu 24.04 on ARM64 (Ampere A1) with 4 OCPUs and 23GB RAM, well above the 2 vCPU and 2GB minimum required. Docker was already installed from Stage 2. I confirmed it was working: I also needed to open port 8080 for the detector dashboard since the previous stages only had 80 and 443 open: I also added port 8080 in Oracle Cloud's Security List the same way I added ports 80 and 443 back in Stage 0. The task required the dashboard to be served at a domain or subdomain. I already had gideonbature.duckdns.org from Stage 1, pointing to my server IP 92.5.80.18. I went to duckdns.org, logged in, and created a second entry: After saving, I verified the DNS was resolving correctly: I built the detector locally on my Mac and organised it into seven separate modules, each with a single responsibility: I also wrote the Nginx configuration and Docker Compose file to wire everything together. Once everything was ready, I pushed it to GitHub: One issue immediately: GitHub's secret scanning blocked the push because my Slack webhook URL was hardcoded in config.yaml. The fix was to use an environment variable placeholder instead: And resolve it at runtime in main.py: After removing the hardcoded secret, the push went through. I cloned the repository on the server: Then created the .env file with my real values: Then brought the stack up: The first build attempt failed with: psutil is a Python library for reading CPU and memory usage. It contains native C code that needs to be compiled. My Dockerfile was using python:3.11-alpine, and Alpine is a minimal Linux image that does not include build tools by default. The fix was to add the required build dependencies to the Dockerfile: After adding those four packages, the build succeeded. After the build succeeded, I ran docker compose up -d and saw this warning: The task specified using the image kefaslungu/hng-nextcloud, but that image was built only for AMD64. My Oracle Cloud server runs on ARM64. Docker warned about the mismatch and then Nextcloud started crashing repeatedly with: This is a binary incompatibility. The AMD64 binary simply cannot execute on an ARM64 processor. Because Nextcloud was crashing, Nginx could not resolve the nextcloud hostname in its config and also crashed: And because Nginx was down, the detector dashboard was also unreachable even though the detector itself was running fine. The fix was to replace the image with the official Nextcloud image which supports multiple architectures including ARM64: After this change, Nextcloud started correctly, Nginx resolved nextcloud successfully, and the dashboard became accessible. With all three containers running, I verified each piece: The Nextcloud check returned a 200 OK with Nextcloud headers. The dashboard check returned 200 OK showing the live metrics page. I triggered a test flood to confirm bans were working: The dashboard showed a ban fired correctly. But I never received a Slack notification. Checking the detector logs revealed: The webhook URL was being rejected by Slack. This happened because I had initially hardcoded the URL in config.yaml, generated a new webhook URL after the GitHub push was blocked, but only updated the .env file. The config.yaml file inside the container still had the old expired URL hardcoded. The fix was to update config.yaml to use the environment variable placeholder: And rebuild the detector container: After this, the next ban produced a proper Slack notification immediately. I ran another flood and saw the ban fire in the logs: But when I checked the host iptables: Nothing appeared. The DROP rule was being added inside the Docker container's network namespace, not the host machine's. This means the attacker's traffic was still reaching the server untouched. The fix was to run the detector with network_mode: host, which makes the container share the host's network stack directly: When using network_mode: host, the container cannot be on a named network, so I also removed the networks and ports entries from the detector service. This introduced a new problem: Nginx could no longer reach the detector at detector:8080 since the detector was no longer on the Docker bridge network. The fix for the dashboard proxy was to use the Docker bridge gateway IP, which I found with: I updated nginx.conf to proxy the dashboard to 172.19.0.1:8080 instead of detector:8080: After restarting nginx the dashboard came back up and iptables DROP rules now appeared on the host. With all fixes applied, I ran a final full test. I opened three terminals: Terminal 1 (server) — watch iptables live: Terminal 2 (server) — watch detector logs live: Terminal 3 (Mac) — run the flood: Within about 10 seconds of the flood starting, the detector fired: Terminal 1 immediately showed: Slack received both the ban notification and the global anomaly alert. Ten minutes later, the unbanner thread fired and Slack received the unban notification. Now that you have seen the process, let me explain the key concepts clearly. Now here is the core concept. To know if traffic is abnormal, you first need to know the current rate of traffic. A sliding window is how you do that efficiently. Think of it like a conveyor belt. Each request puts a timestamp on the belt. The belt is exactly 60 seconds long. Old timestamps fall off the left end automatically. At any moment, the number of timestamps on the belt divided by 60 gives you the current requests per second. In Python, a collections.deque is the perfect data structure for this because removing items from the left (eviction) is O(1), meaning it takes the same time regardless of how many items are in the queue. Every time a new log entry arrives, its timestamp goes into both the per-IP deque and the global deque. Every second, the evaluator evicts old timestamps and counts what remains. If the count is too high, it fires an alert. The key thing to understand: this window is exact. It is not an approximation or a counter that resets every minute. It literally counts every request that arrived in the last 60 seconds. Knowing the current rate is only half the problem. You also need to know what normal looks like. Is 10 requests per second high? It depends. At 3am with one user, yes. At 2pm with many users, maybe not. This is where the rolling baseline comes in. Instead of hardcoding a threshold like "block anything above 100 req/s", the daemon learns from actual traffic. The baseline works like this: Step 1: Count requests per second in buckets Each second gets its own bucket with a count. These buckets cover the last 30 minutes. Step 2: Every 60 seconds, recalculate mean and standard deviation The mean tells you the average rate. The standard deviation tells you how much the rate varies normally. Step 3: Per-hour slots Traffic at 3am is different from traffic at 3pm. The baseline maintains a separate record for each clock hour. When the current hour has enough data, it is preferred over the global average. This means the detector adapts to time-of-day patterns automatically. During very quiet periods, the computed mean might be nearly zero. If the mean is 0.001 req/s and one request arrives, the rate is suddenly thousands of times the mean, which would trigger a false alarm. To prevent this, a minimum floor is enforced: With the current rate and the baseline established, detection is a single calculation called the z-score: The z-score measures how many standard deviations the current rate is above the mean. In a normal distribution: The 5x multiplier catches sudden bursts even when the stddev is large. The z-score threshold catches sustained elevated rates even when the burst is not huge but is statistically abnormal. Error surge detection adds another layer. If an IP is sending a lot of 4xx or 5xx errors (typical of scanners probing for vulnerabilities), the thresholds are automatically tightened by 50%, making it easier to ban that IP: When an IP is flagged as anomalous, the blocker adds a DROP rule to the Linux firewall. iptables is the kernel-level packet filter in Linux. Adding a DROP rule means the kernel silently discards all packets from that IP before they even reach Nginx. The attacker gets no response at all. The -I INPUT flag inserts the rule at position 1, which means it is evaluated before any other rules. This is important because iptables processes rules in order and stops at the first match. The backoff schedule means repeat offenders get banned for longer: When the ban expires, the unbanner thread removes the rule: And sends a Slack notification so the operator knows the IP has been released. The dashboard is a Flask web application that serves a single HTML page. It auto-refreshes every 3 seconds using an HTML meta refresh tag and shows: There is also a /api/metrics JSON endpoint so the data can be consumed programmatically: Everything passed. The daemon was left running for the required 12 continuous hours and responded correctly to the test attack traffic sent by the graders. Stage 3 introduced concepts that real security engineers work with every day: The hardest bugs were not the obvious ones. The architecture mismatch, the iptables namespace issue, and the stale webhook URL were all invisible until the system was running under real conditions. That is what makes security tooling hard and interesting. Stage 4 is next. Follow along as I keep documenting the journey. Find me on Dev.to | GitHub Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse