Tools

Tools: Breaking: Building a Rolling-Baseline HTTP Anomaly Detector (No Fail2Ban)

2026-04-28 0 views admin

The Stack

How Detection Works

Layer 1: Sliding Windows (60 seconds)

Layer 2: Rolling Baseline (30 minutes)

Layer 3: Anomaly Evaluation

What Happens When an Anomaly Fires

Global Anomaly → Slack Only

Per-IP Anomaly → iptables DROP + Slack + Audit

Tiered Auto-Unban

The Audit Trail

The Dashboard

Baseline Over Time

Lessons Learned

Running It Yourself

What I'd Improve Every VPS running a public web app gets hit with traffic it didn't ask for, from scrapers, brute-force login attempts, or just someone's misconfigured bot hammering the same endpoint every second. Most tutorials say "install Fail2Ban and move on." But what if you want to understand the traffic before you block it? What if you need thresholds that adapt to your actual load instead of a hardcoded "5 failures in 10 minutes"? That's what I built for the HNG DevOps track: a Python daemon that tails Nginx access logs, compares live request rates to a rolling 30-minute baseline, and reacts — Slack alerts for global spikes, iptables DROP for abusive individual IPs, with tiered auto-unban so a single bad minute doesn't permanently lock someone out. Repository: github.com/Trojanhorse7/hng-anomaly-detector The whole system runs on a single Linux VPS with Docker Compose: The detector container runs with network_mode: host and cap_add: NET_ADMIN so its iptables calls affect the actual host firewall — not an isolated container network. The detection pipeline has three layers: sliding windows, rolling baseline, and anomaly evaluation. Every parsed log line feeds into collections.deque structures — one global deque for all requests, and one per source IP. Timestamps older than 60 seconds are continuously evicted from the left side. At any moment, RPS = count / 60. There's no "bucket per minute" approximation. Every request is tracked individually and aged out precisely. Parallel deques track 4xx/5xx errors separately for the error-surge path (more on that below). A background thread recomputes the baseline every 60 seconds. It builds a dense vector of per-second request counts over the last 1,800 seconds (30 minutes) and calculates: There's an important twist: if enough samples exist in the current UTC hour, the baseline uses only that hour's data instead of the full 30-minute window. This matters because traffic patterns shift — 2 AM is different from 2 PM, and the baseline should reflect current conditions, not a blend of quiet and busy periods. Floor values prevent divide-by-zero edge cases in z-score calculations. Every recompute is audited to a structured log file with the timestamp, source (hourly vs full window), and the computed mean/std. For each incoming request, the detector compares current RPS to the baseline. An anomaly fires if either condition is true: Error surge tightening: if an IP's error RPS (4xx/5xx responses) exceeds 3× the baseline error mean, thresholds tighten automatically — z-score drops to 2.0 and the rate multiplier drops to 3×. This means an IP generating lots of failed requests gets scrutinized more aggressively, which is exactly what you want for brute-force login attempts. The system distinguishes between global and per-IP anomalies, and they trigger different responses: If the aggregate RPS across all IPs spikes above the baseline, the detector sends a Slack notification. It does not apply iptables rules — blocking all traffic would take the service down. Global alerts are informational: "your server is seeing unusual load right now." A cooldown (default 120 seconds) prevents Slack spam if the global anomaly persists for minutes. If a single IP is responsible for anomalous traffic, the detector: Permanently banning IPs from a single spike is too aggressive. The system uses escalating timeouts: A background thread checks every 3 seconds for IPs whose ban has expired, removes the iptables rule, and sends an unban Slack notification. The strike counter persists across container restarts via a JSON file (ban_state.json). This means a legitimate user who triggered a false positive gets unblocked in 10 minutes. A repeat offender escalates through the tiers. By the 4th strike, they're gone for good. Every significant event is appended to a structured log file at data/audit.log: This file is the source of truth for debugging, compliance, and the baseline graph (more below). A FastAPI server on port 8080 serves a single-page dashboard with live metrics via WebSocket push (every 2.5 seconds). If WebSocket fails (e.g., behind a proxy without Upgrade support), the page falls back to HTTP polling automatically. The /api/state JSON endpoint returns: One of the requirements was demonstrating that the baseline actually adapts. By parsing BASELINE_RECALC lines from the audit log and plotting effective_mean over time, you can see the baseline shift as traffic patterns change between UTC hours. During a busy period, effective_mean climbs. When traffic drops, it falls. The hourly-slice preference means the baseline reacts to the current hour's pattern rather than being dragged by stale data from 25 minutes ago. 1. JSON logs are non-negotiable. Parsing regex against Nginx's default combined log format is fragile. One unusual user-agent string with spaces and quotes breaks your parser. JSON logs with escape=json in the Nginx config give you reliable field extraction every time. 2. Host networking in Docker is powerful but surprising. network_mode: host means the container shares the host's network stack — iptables rules apply to the actual server, not a virtual bridge. This is exactly what you want for blocking IPs, but it also means port conflicts are your problem. 3. Hardcoded thresholds are the enemy. "Block after 100 requests per minute" sounds reasonable until your app legitimately serves 200 req/s during peak hours. A rolling baseline that adapts to actual traffic means your thresholds stay meaningful whether you're serving 2 req/s at 3 AM or 50 req/s at noon. 4. Tiered responses prevent self-inflicted outages. The first time I tested with aggressive thresholds, my own monitoring IP got permanently banned. Escalating tiers (10m → 30m → 2h → permanent) give false positives a way to recover while still catching persistent abuse. 5. Audit everything. When something goes wrong — a legitimate user gets blocked, or an attack slips through — the audit log tells you exactly what the baseline was, what the detector saw, and why it made the decision it did. Without that, you're guessing. Nextcloud at http://<VPS_IP>/, dashboard at http://<VPS_IP>:8080/. Thresholds, window sizes, and ban durations are all in detector/config.yaml — no code changes needed to tune the system. Built for the HNG DevOps track. The full source is at github.com/Trojanhorse7/hng-anomaly-detector. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ Normal: z > 3.0 OR rate > 5 × mean → anomaly Error surge: z > 2.0 OR rate > 3 × mean → anomaly (tighter) Normal: z > 3.0 OR rate > 5 × mean → anomaly Error surge: z > 2.0 OR rate > 3 × mean → anomaly (tighter) Normal: z > 3.0 OR rate > 5 × mean → anomaly Error surge: z > 2.0 OR rate > 3 × mean → anomaly (tighter) -weight: 500;">git clone https://github.com/Trojanhorse7/hng-anomaly-detector cd hng-anomaly-detector cp .env.example .env # Set SLACK_WEBHOOK_URL in .env -weight: 500;">docker compose build && -weight: 500;">docker compose up -d -weight: 500;">git clone https://github.com/Trojanhorse7/hng-anomaly-detector cd hng-anomaly-detector cp .env.example .env # Set SLACK_WEBHOOK_URL in .env -weight: 500;">docker compose build && -weight: 500;">docker compose up -d -weight: 500;">git clone https://github.com/Trojanhorse7/hng-anomaly-detector cd hng-anomaly-detector cp .env.example .env # Set SLACK_WEBHOOK_URL in .env -weight: 500;">docker compose build && -weight: 500;">docker compose up -d - Nextcloud — the upstream kefaslungu/hng-nextcloud image, unmodified. - Nginx — reverse proxy in front of Nextcloud, configured to write JSON-formatted access logs (not the default combined format). This is critical — structured logs let the detector parse fields reliably instead of regex-guessing. - Detector — a Python 3.12 container that tails the shared log volume, runs the detection logic, calls Slack, and executes iptables commands on the host. - Shared volume — a named Docker volume (HNG-nginx-logs) that Nginx writes to and the detector reads from. - effective_mean — average requests per second - effective_std — standard deviation of per-second counts - Z-score > threshold (default 3.0) — the current rate is more than 3 standard deviations above the baseline mean - Rate > multiplier × baseline mean (default 5×) — the current rate is more than 5 times the average - Adds an iptables -I INPUT -s <IP> -j DROP rule — the IP is immediately blocked at the kernel level, before Nginx even sees the packets. - Sends a Slack notification with the IP, the detection condition (z-score or rate multiplier), the current rate, and the baseline stats. - Writes a structured audit log entry with all the same details plus the ban duration. - BASELINE_RECALC — every 60 seconds, with source (hourly vs full), mean, std - BAN — IP, condition, rate, baseline stats, duration - UNBAN — IP, reason, historical ban count - Uptime, event count, CPU/memory - Current global RPS and baseline effective_mean / effective_std - List of currently banned IPs with tier info - Top 10 source IPs by request count in the current window - Per-IP baselines — currently all IPs are compared against the global baseline. High-traffic legitimate IPs (like a CDN edge) could benefit from their own rolling stats. - HTTPS on the dashboard — right now it's plain HTTP on 8080. A reverse proxy with TLS would be better for production. - Prometheus/Grafana — the audit log works, but a proper time-series database would make baseline visualization trivial. - IPv6 — the current implementation only handles IPv4 in iptables rules.

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolsbreakingbuildingrollingbaselineanomalydetectorfail2ban

More from Tools

Tools: 🙌 OpenHands — Deep Dive & Build-Your-Own Guide 📘

2026-04-28 0

Tools: CI/CD Pipeline Optimization: From 20-Minute to 3-Minute Builds - Full Analysis

2026-04-28 0

Tools: Stop Putting Credentials in Environment Variables: Secret Management for DevOps Teams

2026-04-28 0

Tools: I scanned 5 popular OSS repos in 5 minutes. Here's what I found. - 2025 Update

2026-04-28 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: Breaking: Building a Rolling-Baseline HTTP Anomaly Detector (No Fail2Ban)

The Stack

How Detection Works

Layer 1: Sliding Windows (60 seconds)

Layer 2: Rolling Baseline (30 minutes)

Layer 3: Anomaly Evaluation

What Happens When an Anomaly Fires

Global Anomaly → Slack Only

Per-IP Anomaly → iptables DROP + Slack + Audit

Tiered Auto-Unban

The Audit Trail

The Dashboard

Baseline Over Time

Lessons Learned

Running It Yourself

🏷️ Tags

More from Tools

Tools: 🙌 OpenHands — Deep Dive & Build-Your-Own Guide 📘

Tools: CI/CD Pipeline Optimization: From 20-Minute to 3-Minute Builds - Full Analysis

Tools: Stop Putting Credentials in Environment Variables: Secret Management for DevOps Teams

Tools: I scanned 5 popular OSS repos in 5 minutes. Here's what I found. - 2025 Update

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting