Tools: Complete Guide to How I Built a Real-Time Anomaly Detector
Click here: Github Repo Codebase
02 - From Log Line to Signal
03 - Teaching the System What "Normal" Means
Part 1: Sliding Window (what is happening right now)
Part 2: Rolling Baseline (what is usually happening)
Why this pair works
04 - How the Detector Decides "This Is an Attack"
Rule 1: Z-score check
Rule 2: Multiplier check
Global vs Per-IP decisions
Error-surge tightening (adaptive sensitivity)
Cooldown to avoid alert spam
What this looked like in my real test
05 - When the Detector Acts (Not Just Watches)
1) Per-IP anomaly response
2) Global anomaly response
3) Slack notifications (operational visibility)
4) Auto-unban backoff policy
5) Audit trail (compliance + investigation)
What I observed in live testing
06 - Dashboard and Testing
1) Live metrics dashboard
2) Testing strategy:
3) Practical lessons that mattered most
Closing note One quiet evening, a cloud server looked normal on the surface.
Users logged in, files moved, and everything felt stable.But hidden inside the traffic were strange patterns: sudden spikes, repeated requests, and behavior that did not look human. My task was simple to say, but serious to build: teach the server what "normal" traffic looks like, then react fast when traffic becomes dangerous. Here is the world of this project, in one clear flow: In short: this is not just monitoring. It is a live defense loop. Request arrives -> Request is logged -> Behavior is measured -> Anomaly is detected -> Response is applied Why this matters: fixed limits fail in real systems. Day traffic and night traffic are different. Normal today may look abnormal tomorrow.So instead of hardcoding guesses, this system learns from recent traffic and makes decisions from evidence. I will show how one log line becomes a usable event, field by field, and why that data quality decides whether your detector is smart or blind. Every defense system begins with one question: what exactly happened? In this project, the answer comes from Nginx JSON access logs.Each request is written as one structured line in: /var/log/nginx/hng-access.log A typical line includes: This is important because attackers leave patterns in these fields: The daemon continuously tails this file, line by line, in real time.For each line, it does three clear steps: In the implementation, this is handled by NginxLogMonitor.parse_line() and NginxLogMonitor.follow() in detector/monitor.py, where malformed JSON is ignored safely and valid lines are converted to typed events. If a line is malformed, it is skipped and logged, not allowed to crash the daemon.That keeps the detector resilient during noisy production traffic. Detector daemon processing live log lines At this stage, the system is not "guessing attacks" yet.It is building trusted inputs. And trusted inputs are everything. Because if your input data is wrong, your baseline is wrong.If your baseline is wrong, your alerting is noise. Next, we move from clean events to live behavior modeling using 60-second deque windows and a rolling 30-minute baseline. Now the detector can read traffic.The next question is: how does it know what is normal? I used two ideas together: Think of a moving glass box that always holds only the last 60 seconds of requests. I keep two window views: This gives immediate answers: Why deque?Because it adds and removes from both ends in constant time, which is perfect for real-time eviction. You can see this directly in detector/detector.py, where SlidingWindowEngine keeps global_window and ip_windows as deque objects and evicts old timestamps inside _evict_old(). A spike is only suspicious if you compare it to history.So every second, the system stores request counts and keeps only the last 30 minutes (1800 seconds). Every 60 seconds, it recalculates: It also keeps hourly slots and prefers the current hour once enough data exists.That prevents midnight traffic patterns from being judged with daytime expectations. This behavior is implemented in RollingBaselineEngine.recalculate() in detector/baseline.py, where counts are grouped by hour key and the current hour is preferred when min_current_hour_samples is satisfied. Baseline trend over time with hourly difference Together, they answer the key security question: Is this traffic high, or just high compared to what is normal right now? Next, I will show how the detector makes a final anomaly decision using z-score, multiplier checks, and error-surge tightening. At this point, the detector has two things: Now it must decide: alert or ignore. I use two decision checks.If either one fires, the traffic is marked anomalous. Z-score asks: how far is current traffic from the average, measured in standard deviations? In code, this check is in AnomalyEvaluator.evaluate() (detector/detector.py) where global_z and per-IP ip_z are computed against effective_mean and effective_stddev. This one asks: is current traffic many times bigger than normal mean? The same method (AnomalyEvaluator.evaluate()) also applies the multiplier branch when z-score is not the first trigger. So if one misses, the other often catches. The same logic is applied in two scopes: This matters because some attacks are distributed (global spike), while others are noisy from one source (single-IP flood). Attack traffic often produces lots of 4xx/5xx responses.So if an IP's error rate is much higher than normal, the detector becomes stricter for that IP. This is implemented by comparing per-IP error window rate with baseline error_mean, then switching to tighter thresholds (tightened_z_threshold, tightened_multiplier_threshold) for that source. This helps catch abusive behavior earlier, even if total request rate is not yet extreme. Without control, the same spike could trigger repeated alerts every second.So I added a short cooldown window between repeated alerts for the same scope/IP. That cooldown is enforced in detector/detector.py using monotonic timestamps (_last_global_alert_at and _last_ip_alert_at). That keeps alerts actionable instead of noisy. During burst testing, I observed both: I also saw the detector switch between: That confirmed the decision engine is not relying on one fragile rule. Global anomaly alert in Slack Next, I will show the response side: iptables blocking, Slack notifications, and timed unban logic. Detection alone is not defense.A real system must respond. In this project, once an anomaly is confirmed, the flow becomes: Anomaly -> Action -> Alert -> Audit If one IP behaves abnormally, the detector: The action chain is orchestrated in run() inside detector/main.py, calling IptablesBlocker.block_ip() (detector/blocker.py), then UnbanScheduler.register_ban() (detector/unbanner.py), then notifier and audit handlers. This is the direct containment path. iptables output showing blocked IP If the whole traffic pattern spikes abnormally, but no single attacker is isolated, the detector: This avoids over-blocking during distributed spikes. Every critical event is pushed to Slack with context: Those Slack payloads are assembled in detector/notifier.py via send_global_alert(), send_ban_alert(), and send_unban_alert(). Ban notification in Slack Unban notification in Slack So the security response is visible in real time, not hidden in terminal logs. A banned IP is not always permanent.The scheduler uses progressive durations: This balances two goals: The backoff sequence is configured in detector/config.yaml (blocking.ban_durations_seconds) and applied by UnbanScheduler.register_ban() in (detector/unbanner.py). For every key action, a structured line is written: [timestamp] ACTION ip | condition | rate | baseline | duration This exact format is written by AuditLogger.write() in detector/audit.py, and called from detector/main.py for BASELINE_RECALC, GLOBAL_ALERT, BAN, and UNBAN. This made troubleshooting much easier during testing because I could reconstruct exactly what happened and when. Audit log with BAN, UNBAN, and BASELINE_RECALC events That confirmed the project had moved from "monitoring script" to a real response loop. Next, I will cover the dashboard, testing strategy. At this stage, the core detector could already identify and respond to bad traffic. I added a lightweight dashboard that refreshes every 3 seconds and shows: This is implemented in detector/dashboard.py and started from detector/main.py when dashboard.enabled is true in detector/config.yaml. The dashboard exposes: Live dashboard metrics view That gave me two benefits: I tested in layers, not all at once. Expected: a new parsed event appears in detector output. Expected: global_rps and per-IP counts increase, then decay after traffic stops. Expected: BASELINE_RECALC entries every configured interval with mean/stddev values. Expected: anomaly events for both global and per-IP scopes. Expected: blocked IP rule appears, audit logs show BAN/UNBAN, and Slack receives matching alerts. Expected: live JSON metrics update continuously (global_rps, baseline, banned IPs, uptime). This step-by-step test order made debugging faster because each stage had a single purpose. One caveat from real testing: I once got locked out of my own server because the detector correctly flagged and blocked the source IP I was using for SSH.That incident taught me to always protect administrator IP ranges in blocking.protected_cidrs, keep a recovery path (AWS Console/SSM), and run the detector under systemd so it can recover safely after interruptions. This project started as "watch traffic."
and ended as a full loop: observe -> learn -> detect -> respond -> explain And that last part, explain, is what makes the system defensible in both engineering reviews and security operations. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse