Tools: Latest: How I Built a Real-Time DDoS Detection Engine From Scratch (And What I Learned)
How I Built a Real-Time DDoS Detection Engine From Scratch (And What I Learned) A few weeks ago I was given a task that sounded simple on paper: watch incoming HTTP traffic, figure out what "normal" looks like, and automatically block anything that looks like an attack. No third-party tools. No Fail2Ban. Just raw Python, some math, and a Linux firewall. This post is my attempt to explain how I built it, in plain English, for someone who has never touched security tooling before. WHY THIS PROBLEM IS HARDER THAN IT SOUNDS The obvious first instinct is to say: "just block any IP that sends more than X requests per minute." Simple, right? The problem is that X is different at 2am than it is at 2pm. A server that handles 500 requests per minute at peak hours would be flagged as under attack by a system that only expects 50. And a real attacker sending 60 requests per minute during a quiet period would slip right through. What you actually need is a system that watches traffic over time, builds up a picture of what "normal" looks like right now, and flags anything that deviates significantly from that picture. That is exactly what I built. The stack has three pieces running in Docker: Nginx sits in front of everything, receives all incoming HTTP requests, and writes a log entry for every single one. These logs are in JSON format and go into a shared folder that the other containers can read. Nextcloud is the actual application being protected. It never sees the internet directly. All traffic comes through Nginx. The Detector is a Python daemon that reads Nginx's log file in real time, keeps track of request patterns, does some math, and fires off iptables rules and Slack alerts when it detects something suspicious. The detector is the interesting part. Let me break down how it works. PART 1: READING LOGS IN REAL TIME The first challenge is reading a log file that is actively being written to. This is called "tailing" a file, the same thing the Unix tail -f command does. The implementation is simple but clever: We open the file, skip to the end so we don't replay old entries, and then sit in a loop reading one line at a time. When there is nothing new, we wait 100 milliseconds and try again. When Nginx writes a new entry, we pick it up almost immediately and pass it to the rest of the system. Each log entry is a JSON object that looks like this: PART 2: THE SLIDING WINDOW For every IP address that makes a request, I need to know: how many requests has this IP made in the last 60 seconds? The naive approach would be to store a counter per IP and reset it every minute. The problem with that is you lose precision. If an IP sends 100 requests in the last 5 seconds of a minute and 100 more in the first 5 seconds of the next minute, your per-minute counter sees two calm minutes when the reality was a burst of 200 requests in 10 seconds. The correct approach is a sliding window. Instead of counting resets, you store the exact timestamp of every request and evict the ones that are older than 60 seconds. I use Python's deque (double-ended queue) because removing from the left is O(1), extremely fast. Since timestamps are always added in chronological order, the oldest ones are always on the left. The eviction loop just keeps popping from the left until the oldest remaining entry is within the 60-second window. The same structure runs in parallel for all traffic combined (a global window), and also separately for tracking error responses (4xx and 5xx status codes) per IP. PART 3: HOW THE BASELINE LEARNS The sliding window tells me the current rate. But I need something to compare it against. That is what the baseline does. The baseline maintains two data structures: First, a rolling 30-minute window. Every second, I record how many total requests came in during that second. The rolling window holds the last 1800 of these per-second counts (1800 seconds = 30 minutes). When a new count is added, the oldest one automatically falls off the end. Second, an hourly slot. Traffic patterns are different at different times of day. Morning rush hour looks different from 3am. So I keep a separate list of per-second counts for each hour of the day (0 through 23). If I have enough data for the current hour (more than 60 data points), I use that instead of the 30-minute rolling window, because it reflects the traffic pattern right now more accurately. Every 60 seconds, I recalculate the mean and standard deviation from whichever data source I am using: I floor both values at 1.0 to avoid division by zero later. This also means the system starts in a slightly paranoid state. It assumes traffic should be low until it has seen enough data to know better. PART 4: HOW THE DETECTION LOGIC FIRES Now I have two numbers: the current request rate for a specific IP (from the sliding window) and the baseline (mean and standard deviation of what normal looks like). The detection logic compares them. I use the z-score formula: The z-score tells you how many standard deviations away from the mean a value is. Under a normal distribution, a z-score above 3.0 happens less than 0.3% of the time by chance. So if an IP's rate has a z-score above 3.0, it is almost certainly not normal traffic. There is also a second check that catches fast burst attacks before the z-score has time to react: if the raw rate exceeds 5 times the mean, the IP is flagged regardless of the z-score. This is important because the z-score is calculated from historical data, and if an attacker ramps up from zero to hundreds of requests per second in a few moments, the standard deviation may not have widened fast enough to capture it. There is a third condition that makes the system more sensitive to scanning attacks. If an IP's error rate (the proportion of 4xx and 5xx responses it receives) is more than 3 times the global error rate, the detector tightens the z-score threshold from 3.0 to 2.0 for that IP. An IP that is getting a lot of errors is probably trying endpoints that do not exist, which is a classic sign of a scanner or a credential-stuffing bot. PART 5: BLOCKING WITH IPTABLES When an IP is flagged as anomalous, the detector calls out to the Linux kernel's built-in packet filter to drop all future traffic from that address. iptables is the firewall that sits inside the Linux kernel and decides what to do with every network packet that arrives at the machine. Rules can be written to accept, reject, or silently drop packets based on source IP, destination port, protocol, and many other factors. The command to block an IP is: From Python, this is just a subprocess call: Bans do not last forever. The system uses an escalating backoff schedule: the first offence gets a 10-minute ban, the second gets 30 minutes, the third gets 2 hours, and any subsequent offence results in a permanent ban. A background thread checks every 30 seconds whether any active ban has expired and removes the iptables rule when it has. To remove a rule, you use the -D flag instead of -I: ALERTS AND THE DASHBOARD Every time a ban is applied or lifted, a Slack message goes out with the IP, the reason, the current rate, the baseline mean, and the ban duration. This gives the on-call team an instant signal that something is happening, even if they are not watching the dashboard. The dashboard itself is a plain HTML page served by Python's built-in http.server module. It refreshes automatically every 3 seconds and shows banned IPs, global request rate, baseline mean and standard deviation, top 10 source IPs, CPU usage, memory usage, and uptime. Everything a responder needs to understand the situation at a glance. The hardest part of this project was not the code. It was understanding the mental model. A detection system is only as good as the question it is asking. "Is this traffic high?" is a useless question without knowing what high means in context. The baseline is the thing that gives the question meaning. The second thing I learned is that blocking at the right level matters. An iptables rule inside a Docker container's own network namespace does nothing for traffic going to a different container. You need to either put the container in the host's network namespace or use the DOCKER-USER chain from the host, which is what I ended up doing. If you are building something similar, start with the simplest possible thing, a counter and a fixed threshold, and get it working end to end before adding the statistics. Once you can see the numbers in a real log file, the math becomes much less abstract. The full source code is on GitHub at https://github.com/PrincewillDev/anomaly_detection_engine. If you have questions or spot a bug, feel free to open an issue. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse