Tools: Complete Guide to How I Built a Real-Time Anomaly Detector

Tools: Complete Guide to How I Built a Real-Time Anomaly Detector

Click here: Github Repo Codebase

02 - From Log Line to Signal

03 - Teaching the System What "Normal" Means

Part 1: Sliding Window (what is happening right now)

Part 2: Rolling Baseline (what is usually happening)

Why this pair works

04 - How the Detector Decides "This Is an Attack"

Rule 1: Z-score check

Rule 2: Multiplier check

Global vs Per-IP decisions

Error-surge tightening (adaptive sensitivity)

Cooldown to avoid alert spam

What this looked like in my real test

05 - When the Detector Acts (Not Just Watches)

1) Per-IP anomaly response

2) Global anomaly response

3) Slack notifications (operational visibility)

4) Auto-unban backoff policy

5) Audit trail (compliance + investigation)

What I observed in live testing

06 - Dashboard and Testing

1) Live metrics dashboard

2) Testing strategy:

3) Practical lessons that mattered most

Closing note One quiet evening, a cloud server looked normal on the surface.

Users logged in, files moved, and everything felt stable.But hidden inside the traffic were strange patterns: sudden spikes, repeated requests, and behavior that did not look human. My task was simple to say, but serious to build: teach the server what "normal" traffic looks like, then react fast when traffic becomes dangerous. Here is the world of this project, in one clear flow: In short: this is not just monitoring. It is a live defense loop. Request arrives -> Request is logged -> Behavior is measured -> Anomaly is detected -> Response is applied Why this matters: fixed limits fail in real systems. Day traffic and night traffic are different. Normal today may look abnormal tomorrow.So instead of hardcoding guesses, this system learns from recent traffic and makes decisions from evidence. I will show how one log line becomes a usable event, field by field, and why that data quality decides whether your detector is smart or blind. Every defense system begins with one question: what exactly happened? In this project, the answer comes from Nginx JSON access logs.Each request is written as one structured line in: /var/log/nginx/hng-access.log A typical line includes: This is important because attackers leave patterns in these fields: The daemon continuously tails this file, line by line, in real time.For each line, it does three clear steps: In the implementation, this is handled by NginxLogMonitor.parse_line() and NginxLogMonitor.follow() in detector/monitor.py, where malformed JSON is ignored safely and valid lines are converted to typed events. If a line is malformed, it is skipped and logged, not allowed to crash the daemon.That keeps the detector resilient during noisy production traffic. Detector daemon processing live log lines At this stage, the system is not "guessing attacks" yet.It is building trusted inputs. And trusted inputs are everything. Because if your input data is wrong, your baseline is wrong.If your baseline is wrong, your alerting is noise. Next, we move from clean events to live behavior modeling using 60-second deque windows and a rolling 30-minute baseline. Now the detector can read traffic.The next question is: how does it know what is normal? I used two ideas together: Think of a moving glass box that always holds only the last 60 seconds of requests. I keep two window views: This gives immediate answers: Why deque?Because it adds and removes from both ends in constant time, which is perfect for real-time eviction. You can see this directly in detector/detector.py, where SlidingWindowEngine keeps global_window and ip_windows as deque objects and evicts old timestamps inside _evict_old(). A spike is only suspicious if you compare it to history.So every second, the system stores request counts and keeps only the last 30 minutes (1800 seconds). Every 60 seconds, it recalculates: It also keeps hourly slots and prefers the current hour once enough data exists.That prevents midnight traffic patterns from being judged with daytime expectations. This behavior is implemented in RollingBaselineEngine.recalculate() in detector/baseline.py, where counts are grouped by hour key and the current hour is preferred when min_current_hour_samples is satisfied. Baseline trend over time with hourly difference Together, they answer the key security question: Is this traffic high, or just high compared to what is normal right now? Next, I will show how the detector makes a final anomaly decision using z-score, multiplier checks, and error-surge tightening. At this point, the detector has two things: Now it must decide: alert or ignore. I use two decision checks.If either one fires, the traffic is marked anomalous. Z-score asks: how far is current traffic from the average, measured in standard deviations? In code, this check is in AnomalyEvaluator.evaluate() (detector/detector.py) where global_z and per-IP ip_z are computed against effective_mean and effective_stddev. This one asks: is current traffic many times bigger than normal mean? The same method (AnomalyEvaluator.evaluate()) also applies the multiplier branch when z-score is not the first trigger. So if one misses, the other often catches. The same logic is applied in two scopes: This matters because some attacks are distributed (global spike), while others are noisy from one source (single-IP flood). Attack traffic often produces lots of 4xx/5xx responses.So if an IP's error rate is much higher than normal, the detector becomes stricter for that IP. This is implemented by comparing per-IP error window rate with baseline error_mean, then switching to tighter thresholds (tightened_z_threshold, tightened_multiplier_threshold) for that source. This helps catch abusive behavior earlier, even if total request rate is not yet extreme. Without control, the same spike could trigger repeated alerts every second.So I added a short cooldown window between repeated alerts for the same scope/IP. That cooldown is enforced in detector/detector.py using monotonic timestamps (_last_global_alert_at and _last_ip_alert_at). That keeps alerts actionable instead of noisy. During burst testing, I observed both: I also saw the detector switch between: That confirmed the decision engine is not relying on one fragile rule. Global anomaly alert in Slack Next, I will show the response side: iptables blocking, Slack notifications, and timed unban logic. Detection alone is not defense.A real system must respond. In this project, once an anomaly is confirmed, the flow becomes: Anomaly -> Action -> Alert -> Audit If one IP behaves abnormally, the detector: The action chain is orchestrated in run() inside detector/main.py, calling IptablesBlocker.block_ip() (detector/blocker.py), then UnbanScheduler.register_ban() (detector/unbanner.py), then notifier and audit handlers. This is the direct containment path. iptables output showing blocked IP If the whole traffic pattern spikes abnormally, but no single attacker is isolated, the detector: This avoids over-blocking during distributed spikes. Every critical event is pushed to Slack with context: Those Slack payloads are assembled in detector/notifier.py via send_global_alert(), send_ban_alert(), and send_unban_alert(). Ban notification in Slack Unban notification in Slack So the security response is visible in real time, not hidden in terminal logs. A banned IP is not always permanent.The scheduler uses progressive durations: This balances two goals: The backoff sequence is configured in detector/config.yaml (blocking.ban_durations_seconds) and applied by UnbanScheduler.register_ban() in (detector/unbanner.py). For every key action, a structured line is written: [timestamp] ACTION ip | condition | rate | baseline | duration This exact format is written by AuditLogger.write() in detector/audit.py, and called from detector/main.py for BASELINE_RECALC, GLOBAL_ALERT, BAN, and UNBAN. This made troubleshooting much easier during testing because I could reconstruct exactly what happened and when. Audit log with BAN, UNBAN, and BASELINE_RECALC events That confirmed the project had moved from "monitoring script" to a real response loop. Next, I will cover the dashboard, testing strategy. At this stage, the core detector could already identify and respond to bad traffic. I added a lightweight dashboard that refreshes every 3 seconds and shows: This is implemented in detector/dashboard.py and started from detector/main.py when dashboard.enabled is true in detector/config.yaml. The dashboard exposes: Live dashboard metrics view That gave me two benefits: I tested in layers, not all at once. Expected: a new parsed event appears in detector output. Expected: global_rps and per-IP counts increase, then decay after traffic stops. Expected: BASELINE_RECALC entries every configured interval with mean/stddev values. Expected: anomaly events for both global and per-IP scopes. Expected: blocked IP rule appears, audit logs show BAN/UNBAN, and Slack receives matching alerts. Expected: live JSON metrics update continuously (global_rps, baseline, banned IPs, uptime). This step-by-step test order made debugging faster because each stage had a single purpose. One caveat from real testing: I once got locked out of my own server because the detector correctly flagged and blocked the source IP I was using for SSH.That incident taught me to always protect administrator IP ranges in blocking.protected_cidrs, keep a recovery path (AWS Console/SSM), and run the detector under systemd so it can recover safely after interruptions. This project started as "watch traffic."

and ended as a full loop: observe -> learn -> detect -> respond -> explain And that last part, explain, is what makes the system defensible in both engineering reviews and security operations. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ Request arrives -> Request is logged -> Behavior is measured -> Anomaly is detected -> Response is applied /var/log/nginx/hng-access.log response_size NginxLogMonitor.parse_line() NginxLogMonitor.follow() detector/monitor.py detector/detector.py SlidingWindowEngine global_window _evict_old() effective_mean effective_stddev RollingBaselineEngine.recalculate() detector/baseline.py min_current_hour_samples Is this traffic high, or just high compared to what is normal right now? z-score > 3.0 AnomalyEvaluator.evaluate() detector/detector.py effective_mean effective_stddev current_rate > 5 x baseline_mean AnomalyEvaluator.evaluate() IP 4xx/5xx rate > 3 x baseline error rate tightened_z_threshold tightened_multiplier_threshold detector/detector.py _last_global_alert_at _last_ip_alert_at scope=global ban_candidate Anomaly -> Action -> Alert -> Audit detector/main.py IptablesBlocker.block_ip() detector/blocker.py UnbanScheduler.register_ban() detector/unbanner.py detector/notifier.py send_global_alert() send_ban_alert() send_unban_alert() detector/config.yaml blocking.ban_durations_seconds UnbanScheduler.register_ban() detector/unbanner.py [timestamp] ACTION ip | condition | rate | baseline | duration AuditLogger.write() detector/audit.py detector/main.py BASELINE_RECALC GLOBAL_ALERT BASELINE_RECALC GLOBAL_ALERT detector/dashboard.py detector/main.py dashboard.enabled detector/config.yaml -weight: 500;">curl -I http://<SERVER_IP>/ -weight: 500;">curl -I http://<SERVER_IP>/ -weight: 500;">curl -I http://<SERVER_IP>/ for i in $(seq 1 40); do -weight: 500;">curl -s -o /dev/null -I http://<SERVER_IP>/; done for i in $(seq 1 40); do -weight: 500;">curl -s -o /dev/null -I http://<SERVER_IP>/; done for i in $(seq 1 40); do -weight: 500;">curl -s -o /dev/null -I http://<SERVER_IP>/; done tail -f detection-engine/detector/audit.log tail -f detection-engine/detector/audit.log tail -f detection-engine/detector/audit.log BASELINE_RECALC for i in $(seq 1 300); do -weight: 500;">curl -s -o /dev/null -I http://<SERVER_IP>/; done for i in $(seq 1 300); do -weight: 500;">curl -s -o /dev/null -I http://<SERVER_IP>/; done for i in $(seq 1 300); do -weight: 500;">curl -s -o /dev/null -I http://<SERVER_IP>/; done -weight: 600;">sudo iptables -L -n | rg DROP tail -n 30 detection-engine/detector/audit.log -weight: 600;">sudo iptables -L -n | rg DROP tail -n 30 detection-engine/detector/audit.log -weight: 600;">sudo iptables -L -n | rg DROP tail -n 30 detection-engine/detector/audit.log -weight: 500;">curl http://<METRICS_SUBDOMAIN>/metrics -weight: 500;">curl http://<METRICS_SUBDOMAIN>/metrics -weight: 500;">curl http://<METRICS_SUBDOMAIN>/metrics blocking.protected_cidrs observe -> learn -> detect -> respond -> explain - Nginx stands at the gate and receives every request. - Nextcloud serves the actual application. - Nginx JSON logs record each request as structured data. - Detector daemon reads those logs in real time, learns normal behavior, and flags anomalies. - iptables blocks abusive IPs when needed. - Slack alerts report important incidents immediately. - Dashboard shows live health, rates, and bans. - source_ip: who sent the request - timestamp: when it happened - method: how it was sent (GET, POST, etc.) - path: what endpoint was requested - -weight: 500;">status: how the server responded (200, 404, 500, ...) - response_size: how much data was returned - same IP sending too many requests - repeated hits on sensitive paths - unusual spikes in 4xx/5xx errors - Parse the JSON safely. - Validate required fields. - Normalize values into one clean event object used by the detector. - a 60-second sliding window for live speed - a 30-minute rolling baseline for memory - Every new request timestamp is added. - Any timestamp older than 60 seconds is removed. - The number left in the box is the current traffic pressure. - global window: all requests together - per-IP windows: each IP gets its own 60-second queue - global req/s - top talkers by IP - who is suddenly noisy - effective_mean (average normal load) - effective_stddev (how much normal load fluctuates) - error_mean (average 4xx/5xx pressure) - Sliding window = fast eyes (present moment) - Baseline = memory (recent behavior) - live request rate (from the sliding window) - normal behavior reference (from the baseline) - small z-score = normal fluctuation - high z-score = unusual surge - z-score > 3.0 - current_rate > 5 x baseline_mean - z-score catches statistical outliers - multiplier catches blunt spikes even when variance is noisy - global scope: total traffic rate - per-IP scope: each source IP rate - if IP 4xx/5xx rate > 3 x baseline error rate, use tighter thresholds - scope=global anomaly alerts - scope=ip anomaly alerts with ban_candidate - z-score trigger - rate-multiplier trigger - inserts an iptables DROP rule for that source IP - sends a Slack ban alert - writes an audit log entry - registers the ban in the unban scheduler - sends Slack global alert only - writes audit event - does not auto-ban all traffic - condition triggered - current rate - baseline value - ban duration (for IP bans) - 1st offense: 10 minutes - 2nd offense: 30 minutes - 3rd offense: 2 hours - later offenses: permanent - recover from possible false positives - become stricter with repeat offenders - BASELINE_RECALC - GLOBAL_ALERT - anomaly detection trigger on both global and per-IP scopes - iptables DROP rules added for abusive traffic - matching structured audit entries written immediately - detector continuing to run and recalculate baseline after enforcement - current global request rate - top 10 source IPs - currently banned IPs - effective baseline mean/stddev - CPU and memory usage - detector uptime - / for human-readable UI - /metrics for structured JSON output - instant operator visibility during attacks - clear proof of live behavior - Ingestion test Confirm log parsing with real Nginx traffic. - Window test Verify counts grow and decay correctly over 60 seconds. - Baseline test Verify periodic recalculation and floor behavior. - Detection test Trigger both z-score and multiplier conditions. - Response test Confirm ban, unban, Slack alert, and audit logging. - Dashboard test Confirm metrics -weight: 500;">update every <= 3 seconds. - Separate detection from action. Decision logic stayed cleaner when firewall and notifier code lived in dedicated modules. - Structured logs save time. During debugging, audit lines were faster to reason about than raw terminal noise. - Baseline quality is everything. If baseline windows are weak, detector confidence collapses. - Networking details matter. Container-to-host routing and bind addresses can break dashboards even when app logic is correct. - Build first, polish later. Drafting documentation in batches prevented context loss while implementation evolved.