Tools: SwiftDeploy: Building a Self-Writing Infrastructure Manager with Policy Enforcement — A Complete Technical Walkthrough

Tools: SwiftDeploy: Building a Self-Writing Infrastructure Manager with Policy Enforcement — A Complete Technical Walkthrough

Table of Contents

The Problem We're Solving {#the-problem}

Architecture Overview {#architecture}

Stage 4A — The Engine {#stage-4a}

The Project Structure

The Manifest {#the-manifest}

The Python HTTP Service {#the-python-service}

Configuration from environment

Thread-safe chaos state

The three Stage 4A endpoints

The Dockerfile {#the-dockerfile}

The Templates {#the-templates}

The CLI — Five Stage 4A Subcommands {#the-cli}

The template engine

init — generates everything from the manifest

validate — five pre-flight checks

deploy — brings up the stack and blocks until healthy

promote — rolling restart with zero nginx downtime

The Two Deployment Modes {#deployment-modes}

Stage 4B — The Eyes and The Brain {#stage-4b}

Prometheus Metrics — The Eyes {#prometheus-metrics}

Tracking infrastructure

The five metrics

The /metrics output

OPA Policy Engine — The Brain {#opa-policy-engine}

Why OPA instead of if/else in the CLI?

data.json — thresholds live here, never hardcoded in Rego

infrastructure.rego — pre-deploy policy

canary.rego — pre-promote policy

OPA isolation — no leakage via nginx

The policy query function

Gated Lifecycle — The CLI Brain {#gated-lifecycle}

Pre-deploy check

Pre-promote check

P99 latency calculation from histogram

The Status Dashboard — The Eyes {#status-dashboard}

The Audit Trail — The Memory {#audit-trail}

The Debugging Sagas {#debugging}

Saga 1 — Six layers of healthcheck failure

Saga 2 — OPA Rego v1 syntax

Saga 3 — WSL2 path spaces breaking docker run

Full Deployment Walkthrough {#deployment}

Key Lessons Learned {#lessons}

Conclusion How I built a CLI tool that generates its own infrastructure configs, manages a full containerised stack, enforces deployment policies through OPA, exposes Prometheus metrics, and produces a live audit trail — all from a single YAML file. Every time you spin up a new service in a real DevOps environment, you repeat the same manual work: SwiftDeploy solves all of this. One YAML manifest describes your entire deployment. A CLI tool generates every config file from it, manages the container lifecycle, enforces safety policies before allowing deployments, exposes real-time metrics, and produces a full audit trail. The golden rule: the manifest is the single source of truth. Everything else is generated. Stage 4A is the foundation. It answers one question: how do you build a tool that writes its own infrastructure configs? manifest.yaml is the brain of the entire system. Every component reads from it directly or via generated files. Every field propagates through the system. Change nginx.proxy_timeout here and it updates in nginx.conf on the next init. Change services.mode here and the entire deployment mode switches on the next promote. The app is a from-scratch HTTP server using only Python's stdlib — no Flask, no FastAPI. Three endpoints in Stage 4A, four in Stage 4B. Configuration comes entirely from environment variables injected by Docker Compose at runtime. START_TIME is captured at module load — this is how /healthz calculates uptime without a database. The Lock prevents race conditions when multiple requests read and write chaos state simultaneously. dict(chaos_state) returns a copy so the caller never holds a reference to the mutable internal dict. GET /healthz — liveness check The /healthz endpoint does three jobs: proves the server is alive (Docker healthcheck), reports current mode (so promote can confirm the switch happened), and reports uptime (useful for debugging restart loops). POST /chaos — chaos injection (canary only) Reading Content-Length before rfile.read() is mandatory HTTP protocol — otherwise read() blocks forever waiting for data that never arrives. Why Alpine? python:3.12-alpine is ~60MB. python:3.12 (Debian) is ~1GB. We need under 300MB. Why non-root? If someone exploits the app, they get a powerless user, not root access to the server. Why Python urllib for the healthcheck? This was a hard-won lesson. wget couldn't resolve localhost inside Alpine on WSL2 + Docker Desktop — a known network namespace quirk. Python's urllib bypasses the system resolver entirely and connects directly via socket. More reliable, no external tool needed. Templates are blueprints with {{ placeholder }} values that the CLI replaces with real values from the manifest. nginx.conf.tmpl key sections: docker-compose.yml.tmpl key sections: The expose vs ports distinction is a security boundary. expose = container-to-container only. ports = host-facing. The app is never reachable from outside Docker. Five lines. No Jinja2. Simple string replacement. The templates are straightforward enough that a minimal custom engine is cleaner than pulling in a dependency. The grader deletes generated files and re-runs init to verify regeneration. Because we built it correctly, this is a non-issue. Check 5 is the most interesting. We run nginx -t in a temporary container — validating syntax without needing nginx installed on the host. But we hit a complication: app:3000 (the upstream hostname) can't be resolved in an isolated container. Fix: swap server app: with server 127.0.0.1: in a temp copy before testing. Actual nginx.conf on disk is untouched. docker compose up -d returns as soon as containers are created, not when they're healthy. The polling loop tries /healthz every 2 seconds for 60 seconds. Only {"status": "ok"} breaks the loop. --no-deps is the key — it tells Compose to restart only the app service without touching nginx. Zero proxy downtime during the switch. Stable mode — normal production behaviour. Clean responses. No special headers. /chaos returns 403. Canary mode — test mode before full rollout: Stage 4A built the engine. Stage 4B adds observability and policy enforcement. The stack now has eyes (metrics), a brain (OPA policies), and memory (audit trail). The app gains a /metrics endpoint exposing five metric types in Prometheus text format. record_request() is called after every request regardless of path — timing wraps the entire handler: Open Policy Agent is a dedicated container whose only job is making allow/deny decisions based on rules written in Rego. The core principle: the CLI never makes allow/deny decisions itself. All decision logic lives exclusively in OPA. If the policy logic lives in the CLI, changing a threshold means editing Python code, rebuilding, redeploying. With OPA, you edit data.json, restart the OPA container, and the new threshold is live. Policy as code, not policy as application logic. Why import future.keywords? The openpolicyagent/opa:latest-static image uses Rego v1 which requires explicit if and contains keywords. Without these imports, OPA crashes on startup. This was discovered the hard way during testing. In docker-compose.yml.tmpl: 127.0.0.1:8181 means only the host machine can reach OPA. The nginx container on port 8080 has no route to OPA. This is enforced at the Docker network binding level, not just convention. Every distinct failure mode returns a different error string. The CLI never crashes or hangs when OPA is unavailable — it warns and fails open. This is intentional: you don't want OPA unavailability to block emergency deployments. If the disk is full, OPA returns: P99 means: the smallest histogram bucket where 99% of requests have completed. If 99 out of 100 requests finished within 250ms, P99 = 250ms. What the dashboard looks like with chaos active: This is exactly what real SRE dashboards do — they show you the current state AND whether it violates policy in real time. Every event appends a JSON line to history.jsonl: swiftdeploy audit reads this file and generates audit_report.md: This report renders perfectly as GitHub Flavored Markdown — every table, every checkmark, every timestamp. No real DevOps project ships without war stories. Here are the ones that taught the most. The app container kept showing unhealthy despite the server running fine. The debugging sequence: Failure 1: ${APP_PORT} doesn't expand in Dockerfile HEALTHCHECK CMD. Env vars evaluate at build time, not runtime. Fixed by hardcoding 3000. Failure 2: localhost doesn't resolve inside Alpine's healthcheck context on WSL2 + Docker Desktop. Fixed by using 127.0.0.1. Failure 3: wget with 127.0.0.1 still failed. Confirmed the server WAS listening: This is a known WSL2 + Docker Desktop network namespace issue. Fixed by using Python's urllib instead of wget. Failure 4: Docker cache was serving the old image despite the Dockerfile fix. Fixed with --no-cache. Failure 5: The docker-compose.yml template had its own healthcheck block overriding the Dockerfile. Docker Compose healthcheck always wins. Fixed the template too. Failure 6: The healthcheck YAML block had 3 spaces indent instead of 4. A single space difference caused a YAML parse error. Fixed by carefully rewriting the block. The openpolicyagent/opa:latest-static image enforces strict Rego v1 syntax. Our policies used the older syntax: Without import future.keywords.if and import future.keywords.contains at the top of each file, OPA refuses to start. The project lived at /mnt/c/Users/RAZER BLADE/Desktop/HNG/hng-swiftdeploy. The space in RAZER BLADE caused docker run -v {path}:... to split the path at the space, making Docker interpret the second half as an image name. Fixed by quoting all paths containing the project directory and using subprocess.run with a list instead of shell=True to avoid shell word-splitting entirely. 1. Docker Compose healthcheck overrides Dockerfile HEALTHCHECK. Always check both places when healthchecks misbehave. Compose wins every time. 2. WSL2 has a different network namespace for healthchecks than for docker exec. If something works via exec but not via healthcheck, it's almost certainly a tool or network namespace issue. Python's stdlib is more portable than wget in this environment. 3. OPA Rego v1 requires explicit keywords. latest-static means the latest OPA — which enforces Rego v1 syntax. Always import future.keywords.if and import future.keywords.contains. 4. expose vs ports is a security boundary, not documentation. expose = container-to-container only. ports = host-facing. Binding OPA to 127.0.0.1 enforces isolation at the network level. 5. The CLI should never make policy decisions. Every time you add an if/else for a deployment condition in the CLI, you're doing OPA's job badly. Push all allow/deny logic into Rego. The CLI's job is to collect data and surface decisions. 6. P99 latency is more useful than average. An average latency of 10ms can hide the fact that 1 in 100 requests takes 5 seconds. P99 exposes that tail. Always instrument histograms, not just averages. 7. Declarative infrastructure pays off immediately. The grader deletes generated files and re-runs init. Because the manifest is always there and regeneration is instantaneous, this is a non-issue. Manual configs would have been a problem. 8. An audit trail is not optional. history.jsonl made it trivial to answer "when did chaos start?", "which policy was failing?", "how long was the canary running before we promoted?" These questions matter in production incidents. SwiftDeploy started as a task requirement and became a complete mental model for how modern deployment tooling works. Every major concept is here: The combination of Stage 4A and 4B forms a complete deployment lifecycle: generate → validate → deploy (gated) → promote (gated) → observe → audit → tear down. The full source code is available at: https://github.com/AirFluke/hng-swiftdeploy Tags: #devops #docker #nginx #python #opa #prometheus #infrastructure #hng Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

┌─────────────────────────────────────────────────────────────────────────────┐ │ SwiftDeploy — Full System Architecture │ ├──────────────────┬──────────────────────────────────────┬───────────────────┤ │ ZONE 1 │ ZONE 2 │ ZONE 3 │ │ Operator │ Host Machine / Docker Engine │ Generated Files │ │ │ │ │ │ [Operator] │ ┌─── swiftdeploy-net (bridge) ───┐ │ nginx.conf │ │ │ │ │ │ │ docker-compose │ │ ▼ │ │ [nginx:8080]──────►[app:3000] │ │ history.jsonl │ │ manifest.yaml │ │ PUBLIC INTERNAL │ │ audit_report.md │ │ (source of │ │ │ │ │ │ │ │ truth) │ │ └──[logs vol]─────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ [opa:8181] │ │ │ │ swiftdeploy │ │ localhost only │ │ │ │ CLI │ │ NOT via nginx ✗ │ │ │ │ ├─ init │ └────────────────────────────────┘ │ │ │ ├─ validate │ │ │ │ ├─ deploy ──────┼──► pre-deploy: OPA infra check │ │ │ ├─ promote ─────┼──► pre-promote: OPA canary check │ │ │ ├─ status ──────┼──► scrapes /metrics every 5s ───────►│ history.jsonl │ │ ├─ audit ───────┼────────────────────────────────────►│ audit_report.md │ │ └─ teardown │ │ │ └──────────────────┴──────────────────────────────────────┴───────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SwiftDeploy — Full System Architecture │ ├──────────────────┬──────────────────────────────────────┬───────────────────┤ │ ZONE 1 │ ZONE 2 │ ZONE 3 │ │ Operator │ Host Machine / Docker Engine │ Generated Files │ │ │ │ │ │ [Operator] │ ┌─── swiftdeploy-net (bridge) ───┐ │ nginx.conf │ │ │ │ │ │ │ docker-compose │ │ ▼ │ │ [nginx:8080]──────►[app:3000] │ │ history.jsonl │ │ manifest.yaml │ │ PUBLIC INTERNAL │ │ audit_report.md │ │ (source of │ │ │ │ │ │ │ │ truth) │ │ └──[logs vol]─────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ [opa:8181] │ │ │ │ swiftdeploy │ │ localhost only │ │ │ │ CLI │ │ NOT via nginx ✗ │ │ │ │ ├─ init │ └────────────────────────────────┘ │ │ │ ├─ validate │ │ │ │ ├─ deploy ──────┼──► pre-deploy: OPA infra check │ │ │ ├─ promote ─────┼──► pre-promote: OPA canary check │ │ │ ├─ status ──────┼──► scrapes /metrics every 5s ───────►│ history.jsonl │ │ ├─ audit ───────┼────────────────────────────────────►│ audit_report.md │ │ └─ teardown │ │ │ └──────────────────┴──────────────────────────────────────┴───────────────────┘ ┌─────────────────────────────────────────────────────────────────────────────┐ │ SwiftDeploy — Full System Architecture │ ├──────────────────┬──────────────────────────────────────┬───────────────────┤ │ ZONE 1 │ ZONE 2 │ ZONE 3 │ │ Operator │ Host Machine / Docker Engine │ Generated Files │ │ │ │ │ │ [Operator] │ ┌─── swiftdeploy-net (bridge) ───┐ │ nginx.conf │ │ │ │ │ │ │ docker-compose │ │ ▼ │ │ [nginx:8080]──────►[app:3000] │ │ history.jsonl │ │ manifest.yaml │ │ PUBLIC INTERNAL │ │ audit_report.md │ │ (source of │ │ │ │ │ │ │ │ truth) │ │ └──[logs vol]─────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ [opa:8181] │ │ │ │ swiftdeploy │ │ localhost only │ │ │ │ CLI │ │ NOT via nginx ✗ │ │ │ │ ├─ init │ └────────────────────────────────┘ │ │ │ ├─ validate │ │ │ │ ├─ deploy ──────┼──► pre-deploy: OPA infra check │ │ │ ├─ promote ─────┼──► pre-promote: OPA canary check │ │ │ ├─ status ──────┼──► scrapes /metrics every 5s ───────►│ history.jsonl │ │ ├─ audit ───────┼────────────────────────────────────►│ audit_report.md │ │ └─ teardown │ │ │ └──────────────────┴──────────────────────────────────────┴───────────────────┘ swiftdeploy/ ├── manifest.yaml ← the ONLY file you edit ├── swiftdeploy ← CLI executable ├── Dockerfile ← app image definition ├── app/ │ └── main.py ← Python HTTP service ├── templates/ │ ├── nginx.conf.tmpl ← nginx template │ └── docker-compose.yml.tmpl ← compose template ├── policies/ ← Stage 4B addition │ ├── infrastructure.rego │ ├── canary.rego │ └── data.json ├── nginx.conf ← generated (gitignored) └── docker-compose.yml ← generated (gitignored) swiftdeploy/ ├── manifest.yaml ← the ONLY file you edit ├── swiftdeploy ← CLI executable ├── Dockerfile ← app image definition ├── app/ │ └── main.py ← Python HTTP service ├── templates/ │ ├── nginx.conf.tmpl ← nginx template │ └── docker-compose.yml.tmpl ← compose template ├── policies/ ← Stage 4B addition │ ├── infrastructure.rego │ ├── canary.rego │ └── data.json ├── nginx.conf ← generated (gitignored) └── docker-compose.yml ← generated (gitignored) swiftdeploy/ ├── manifest.yaml ← the ONLY file you edit ├── swiftdeploy ← CLI executable ├── Dockerfile ← app image definition ├── app/ │ └── main.py ← Python HTTP service ├── templates/ │ ├── nginx.conf.tmpl ← nginx template │ └── docker-compose.yml.tmpl ← compose template ├── policies/ ← Stage 4B addition │ ├── infrastructure.rego │ ├── canary.rego │ └── data.json ├── nginx.conf ← generated (gitignored) └── docker-compose.yml ← generated (gitignored) services: image: swift-deploy-1-node:latest port: 3000 mode: stable # stable or canary version: "1.0.0" restart_policy: unless-stopped log_volume: swiftdeploy-logs nginx: image: nginx:latest port: 8080 proxy_timeout: 30 opa: image: openpolicyagent/opa:latest-static port: 8181 network: name: swiftdeploy-net driver_type: bridge contact: "[email protected]" services: image: swift-deploy-1-node:latest port: 3000 mode: stable # stable or canary version: "1.0.0" restart_policy: unless-stopped log_volume: swiftdeploy-logs nginx: image: nginx:latest port: 8080 proxy_timeout: 30 opa: image: openpolicyagent/opa:latest-static port: 8181 network: name: swiftdeploy-net driver_type: bridge contact: "[email protected]" services: image: swift-deploy-1-node:latest port: 3000 mode: stable # stable or canary version: "1.0.0" restart_policy: unless-stopped log_volume: swiftdeploy-logs nginx: image: nginx:latest port: 8080 proxy_timeout: 30 opa: image: openpolicyagent/opa:latest-static port: 8181 network: name: swiftdeploy-net driver_type: bridge contact: "[email protected]" MODE = os.environ.get("MODE", "stable") APP_VERSION = os.environ.get("APP_VERSION", "1.0.0") APP_PORT = int(os.environ.get("APP_PORT", "3000")) START_TIME = time.time() MODE = os.environ.get("MODE", "stable") APP_VERSION = os.environ.get("APP_VERSION", "1.0.0") APP_PORT = int(os.environ.get("APP_PORT", "3000")) START_TIME = time.time() MODE = os.environ.get("MODE", "stable") APP_VERSION = os.environ.get("APP_VERSION", "1.0.0") APP_PORT = int(os.environ.get("APP_PORT", "3000")) START_TIME = time.time() chaos_lock = threading.Lock() chaos_state = {"mode": None, "duration": None, "rate": None} def get_chaos(): with chaos_lock: return dict(chaos_state) # returns a copy — callers can't mutate internal state def set_chaos(state): with chaos_lock: chaos_state.update(state) chaos_lock = threading.Lock() chaos_state = {"mode": None, "duration": None, "rate": None} def get_chaos(): with chaos_lock: return dict(chaos_state) # returns a copy — callers can't mutate internal state def set_chaos(state): with chaos_lock: chaos_state.update(state) chaos_lock = threading.Lock() chaos_state = {"mode": None, "duration": None, "rate": None} def get_chaos(): with chaos_lock: return dict(chaos_state) # returns a copy — callers can't mutate internal state def set_chaos(state): with chaos_lock: chaos_state.update(state) self.send_json(200, { "message": "Welcome to SwiftDeploy API", "mode": MODE, "version": APP_VERSION, "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), }) self.send_json(200, { "message": "Welcome to SwiftDeploy API", "mode": MODE, "version": APP_VERSION, "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), }) self.send_json(200, { "message": "Welcome to SwiftDeploy API", "mode": MODE, "version": APP_VERSION, "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), }) uptime = round(time.time() - START_TIME, 2) self.send_json(200, { "status": "ok", "mode": MODE, "version": APP_VERSION, "uptime_seconds": uptime, }) uptime = round(time.time() - START_TIME, 2) self.send_json(200, { "status": "ok", "mode": MODE, "version": APP_VERSION, "uptime_seconds": uptime, }) uptime = round(time.time() - START_TIME, 2) self.send_json(200, { "status": "ok", "mode": MODE, "version": APP_VERSION, "uptime_seconds": uptime, }) if MODE != "canary": self.send_json(403, {"error": "chaos endpoint only available in canary mode"}) return length = int(self.headers.get("Content-Length", 0)) data = json.loads(self.rfile.read(length)) mode = data.get("mode") if mode == "slow": set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None}) elif mode == "error": set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)}) elif mode == "recover": set_chaos({"mode": None, "duration": None, "rate": None}) if MODE != "canary": self.send_json(403, {"error": "chaos endpoint only available in canary mode"}) return length = int(self.headers.get("Content-Length", 0)) data = json.loads(self.rfile.read(length)) mode = data.get("mode") if mode == "slow": set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None}) elif mode == "error": set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)}) elif mode == "recover": set_chaos({"mode": None, "duration": None, "rate": None}) if MODE != "canary": self.send_json(403, {"error": "chaos endpoint only available in canary mode"}) return length = int(self.headers.get("Content-Length", 0)) data = json.loads(self.rfile.read(length)) mode = data.get("mode") if mode == "slow": set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None}) elif mode == "error": set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)}) elif mode == "recover": set_chaos({"mode": None, "duration": None, "rate": None}) FROM python:3.12-alpine RUN addgroup -S appgroup && adduser -S appuser -G appgroup WORKDIR /app COPY app/main.py . RUN chown -R appuser:appgroup /app USER appuser ENV MODE=stable ENV APP_VERSION=1.0.0 ENV APP_PORT=3000 EXPOSE 3000 HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1 CMD ["python", "main.py"] FROM python:3.12-alpine RUN addgroup -S appgroup && adduser -S appuser -G appgroup WORKDIR /app COPY app/main.py . RUN chown -R appuser:appgroup /app USER appuser ENV MODE=stable ENV APP_VERSION=1.0.0 ENV APP_PORT=3000 EXPOSE 3000 HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1 CMD ["python", "main.py"] FROM python:3.12-alpine RUN addgroup -S appgroup && adduser -S appuser -G appgroup WORKDIR /app COPY app/main.py . RUN chown -R appuser:appgroup /app USER appuser ENV MODE=stable ENV APP_VERSION=1.0.0 ENV APP_PORT=3000 EXPOSE 3000 HEALTHCHECK --interval=10s --timeout=5s --start-period=15s --retries=5 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:3000/healthz', timeout=4)" || exit 1 CMD ["python", "main.py"] upstream app_backend { server app:{{ service_port }}; keepalive 32; } log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request'; server { listen {{ nginx_port }}; proxy_connect_timeout {{ proxy_timeout }}s; proxy_send_timeout {{ proxy_timeout }}s; proxy_read_timeout {{ proxy_timeout }}s; add_header X-Deployed-By swiftdeploy always; proxy_pass_header X-Mode; location @error502 { default_type application/json; return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"{{ contact }}"}'; } } upstream app_backend { server app:{{ service_port }}; keepalive 32; } log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request'; server { listen {{ nginx_port }}; proxy_connect_timeout {{ proxy_timeout }}s; proxy_send_timeout {{ proxy_timeout }}s; proxy_read_timeout {{ proxy_timeout }}s; add_header X-Deployed-By swiftdeploy always; proxy_pass_header X-Mode; location @error502 { default_type application/json; return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"{{ contact }}"}'; } } upstream app_backend { server app:{{ service_port }}; keepalive 32; } log_format swiftdeploy '$time_iso8601 | $status | ${request_time}s | $upstream_addr | $request'; server { listen {{ nginx_port }}; proxy_connect_timeout {{ proxy_timeout }}s; proxy_send_timeout {{ proxy_timeout }}s; proxy_read_timeout {{ proxy_timeout }}s; add_header X-Deployed-By swiftdeploy always; proxy_pass_header X-Mode; location @error502 { default_type application/json; return 502 '{"error":"Bad Gateway","code":502,"service":"app","contact":"{{ contact }}"}'; } } app: expose: - "{{ service_port }}" # container-to-container only, NEVER published to host nginx: ports: - "{{ nginx_port }}:{{ nginx_port }}" # only nginx faces the world depends_on: app: condition: service_healthy # nginx waits for app healthcheck to pass app: expose: - "{{ service_port }}" # container-to-container only, NEVER published to host nginx: ports: - "{{ nginx_port }}:{{ nginx_port }}" # only nginx faces the world depends_on: app: condition: service_healthy # nginx waits for app healthcheck to pass app: expose: - "{{ service_port }}" # container-to-container only, NEVER published to host nginx: ports: - "{{ nginx_port }}:{{ nginx_port }}" # only nginx faces the world depends_on: app: condition: service_healthy # nginx waits for app healthcheck to pass def render_template(tmpl_path, context): with open(tmpl_path) as f: content = f.read() for key, val in context.items(): content = content.replace("{{ " + key + " }}", str(val)) return content def render_template(tmpl_path, context): with open(tmpl_path) as f: content = f.read() for key, val in context.items(): content = content.replace("{{ " + key + " }}", str(val)) return content def render_template(tmpl_path, context): with open(tmpl_path) as f: content = f.read() for key, val in context.items(): content = content.replace("{{ " + key + " }}", str(val)) return content def cmd_init(): manifest = load_manifest() ctx = build_context(manifest) nginx_conf = render_template(NGINX_TMPL, ctx) compose_conf = render_template(COMPOSE_TMPL, ctx) with open(NGINX_OUT, "w") as f: f.write(nginx_conf) with open(COMPOSE_OUT, "w") as f: f.write(compose_conf) def cmd_init(): manifest = load_manifest() ctx = build_context(manifest) nginx_conf = render_template(NGINX_TMPL, ctx) compose_conf = render_template(COMPOSE_TMPL, ctx) with open(NGINX_OUT, "w") as f: f.write(nginx_conf) with open(COMPOSE_OUT, "w") as f: f.write(compose_conf) def cmd_init(): manifest = load_manifest() ctx = build_context(manifest) nginx_conf = render_template(NGINX_TMPL, ctx) compose_conf = render_template(COMPOSE_TMPL, ctx) with open(NGINX_OUT, "w") as f: f.write(nginx_conf) with open(COMPOSE_OUT, "w") as f: f.write(compose_conf) # Check 1: manifest.yaml exists and is valid YAML # Check 2: all required fields present and non-empty # Check 3: docker image inspect — exits 0 if exists # Check 4: ss -tlnp | grep :8080 — non-empty means port in use # Check 5: nginx -t via isolated Docker container # Check 1: manifest.yaml exists and is valid YAML # Check 2: all required fields present and non-empty # Check 3: docker image inspect — exits 0 if exists # Check 4: ss -tlnp | grep :8080 — non-empty means port in use # Check 5: nginx -t via isolated Docker container # Check 1: manifest.yaml exists and is valid YAML # Check 2: all required fields present and non-empty # Check 3: docker image inspect — exits 0 if exists # Check 4: ss -tlnp | grep :8080 — non-empty means port in use # Check 5: nginx -t via isolated Docker container test_content = data.replace("server app:", "server 127.0.0.1:") test_content = data.replace("server app:", "server 127.0.0.1:") test_content = data.replace("server app:", "server 127.0.0.1:") deadline = time.time() + 60 while time.time() < deadline: try: with urllib.request.urlopen(f"http://localhost:{nginx_port}/healthz", timeout=3) as resp: if json.loads(resp.read()).get("status") == "ok": healthy = True break except Exception: pass # container still starting — connection refused is normal time.sleep(2) deadline = time.time() + 60 while time.time() < deadline: try: with urllib.request.urlopen(f"http://localhost:{nginx_port}/healthz", timeout=3) as resp: if json.loads(resp.read()).get("status") == "ok": healthy = True break except Exception: pass # container still starting — connection refused is normal time.sleep(2) deadline = time.time() + 60 while time.time() < deadline: try: with urllib.request.urlopen(f"http://localhost:{nginx_port}/healthz", timeout=3) as resp: if json.loads(resp.read()).get("status") == "ok": healthy = True break except Exception: pass # container still starting — connection refused is normal time.sleep(2) # 1. Update manifest in-place content = re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1) # 2. Regenerate docker-compose.yml with new MODE env var # 3. Restart ONLY the app container — nginx stays up run(compose_cmd("up -d --no-deps app")) # 4. Confirm mode via /healthz # 1. Update manifest in-place content = re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1) # 2. Regenerate docker-compose.yml with new MODE env var # 3. Restart ONLY the app container — nginx stays up run(compose_cmd("up -d --no-deps app")) # 4. Confirm mode via /healthz # 1. Update manifest in-place content = re.sub(r"(mode:\s*)(\S+)", f"\\g<1>{target_mode}", content, count=1) # 2. Regenerate docker-compose.yml with new MODE env var # 3. Restart ONLY the app container — nginx stays up run(compose_cmd("up -d --no-deps app")) # 4. Confirm mode via /healthz if MODE == "canary": self.send_header("X-Mode", "canary") # on EVERY response if MODE == "canary": self.send_header("X-Mode", "canary") # on EVERY response if MODE == "canary": self.send_header("X-Mode", "canary") # on EVERY response # Counter: {(method, path, status_code): count} request_counts = {} # Histogram state per path HISTOGRAM_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] request_durations = {} def record_request(method, path, status_code, duration_seconds): with metrics_lock: key = (method, path, str(status_code)) request_counts[key] = request_counts.get(key, 0) + 1 if path not in request_durations: request_durations[path] = { "buckets": {str(le): 0 for le in HISTOGRAM_BUCKETS}, "sum": 0.0, "count": 0, } hist = request_durations[path] hist["sum"] += duration_seconds hist["count"] += 1 for le in HISTOGRAM_BUCKETS: if duration_seconds <= le: hist["buckets"][str(le)] += 1 # Counter: {(method, path, status_code): count} request_counts = {} # Histogram state per path HISTOGRAM_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] request_durations = {} def record_request(method, path, status_code, duration_seconds): with metrics_lock: key = (method, path, str(status_code)) request_counts[key] = request_counts.get(key, 0) + 1 if path not in request_durations: request_durations[path] = { "buckets": {str(le): 0 for le in HISTOGRAM_BUCKETS}, "sum": 0.0, "count": 0, } hist = request_durations[path] hist["sum"] += duration_seconds hist["count"] += 1 for le in HISTOGRAM_BUCKETS: if duration_seconds <= le: hist["buckets"][str(le)] += 1 # Counter: {(method, path, status_code): count} request_counts = {} # Histogram state per path HISTOGRAM_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] request_durations = {} def record_request(method, path, status_code, duration_seconds): with metrics_lock: key = (method, path, str(status_code)) request_counts[key] = request_counts.get(key, 0) + 1 if path not in request_durations: request_durations[path] = { "buckets": {str(le): 0 for le in HISTOGRAM_BUCKETS}, "sum": 0.0, "count": 0, } hist = request_durations[path] hist["sum"] += duration_seconds hist["count"] += 1 for le in HISTOGRAM_BUCKETS: if duration_seconds <= le: hist["buckets"][str(le)] += 1 def do_GET(self): start = time.time() path = self.path.split("?")[0] status = self._handle_get() record_request("GET", path, status, time.time() - start) def do_GET(self): start = time.time() path = self.path.split("?")[0] status = self._handle_get() record_request("GET", path, status, time.time() - start) def do_GET(self): start = time.time() path = self.path.split("?")[0] status = self._handle_get() record_request("GET", path, status, time.time() - start) # 1. http_requests_total — counter, labels: method, path, status_code # 2. http_request_duration_seconds — histogram with standard buckets # 3. app_uptime_seconds — gauge # 4. app_mode — gauge: 0=stable, 1=canary # 5. chaos_active — gauge: 0=none, 1=slow, 2=error # 1. http_requests_total — counter, labels: method, path, status_code # 2. http_request_duration_seconds — histogram with standard buckets # 3. app_uptime_seconds — gauge # 4. app_mode — gauge: 0=stable, 1=canary # 5. chaos_active — gauge: 0=none, 1=slow, 2=error # 1. http_requests_total — counter, labels: method, path, status_code # 2. http_request_duration_seconds — histogram with standard buckets # 3. app_uptime_seconds — gauge # 4. app_mode — gauge: 0=stable, 1=canary # 5. chaos_active — gauge: 0=none, 1=slow, 2=error # HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",path="/",status_code="200"} 42 http_requests_total{method="GET",path="/healthz",status_code="200"} 60 http_requests_total{method="GET",path="/",status_code="500"} 38 # HELP http_request_duration_seconds HTTP request latency in seconds # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{path="/",le="0.005"} 40 http_request_duration_seconds_bucket{path="/",le="+Inf"} 80 http_request_duration_seconds_sum{path="/"} 0.042381 http_request_duration_seconds_count{path="/"} 80 # HELP app_mode Current deployment mode (0=stable, 1=canary) # TYPE app_mode gauge app_mode 1 # HELP chaos_active Current chaos state (0=none, 1=slow, 2=error) # TYPE chaos_active gauge chaos_active 2 # HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",path="/",status_code="200"} 42 http_requests_total{method="GET",path="/healthz",status_code="200"} 60 http_requests_total{method="GET",path="/",status_code="500"} 38 # HELP http_request_duration_seconds HTTP request latency in seconds # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{path="/",le="0.005"} 40 http_request_duration_seconds_bucket{path="/",le="+Inf"} 80 http_request_duration_seconds_sum{path="/"} 0.042381 http_request_duration_seconds_count{path="/"} 80 # HELP app_mode Current deployment mode (0=stable, 1=canary) # TYPE app_mode gauge app_mode 1 # HELP chaos_active Current chaos state (0=none, 1=slow, 2=error) # TYPE chaos_active gauge chaos_active 2 # HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",path="/",status_code="200"} 42 http_requests_total{method="GET",path="/healthz",status_code="200"} 60 http_requests_total{method="GET",path="/",status_code="500"} 38 # HELP http_request_duration_seconds HTTP request latency in seconds # TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{path="/",le="0.005"} 40 http_request_duration_seconds_bucket{path="/",le="+Inf"} 80 http_request_duration_seconds_sum{path="/"} 0.042381 http_request_duration_seconds_count{path="/"} 80 # HELP app_mode Current deployment mode (0=stable, 1=canary) # TYPE app_mode gauge app_mode 1 # HELP chaos_active Current chaos state (0=none, 1=slow, 2=error) # TYPE chaos_active gauge chaos_active 2 { "thresholds": { "min_disk_free_gb": 10.0, "max_cpu_load": 2.0, "min_mem_free_percent": 10.0, "max_error_rate_percent": 1.0, "max_p99_latency_ms": 500.0 } } { "thresholds": { "min_disk_free_gb": 10.0, "max_cpu_load": 2.0, "min_mem_free_percent": 10.0, "max_error_rate_percent": 1.0, "max_p99_latency_ms": 500.0 } } { "thresholds": { "min_disk_free_gb": 10.0, "max_cpu_load": 2.0, "min_mem_free_percent": 10.0, "max_error_rate_percent": 1.0, "max_p99_latency_ms": 500.0 } } package swiftdeploy.infrastructure import future.keywords.if import future.keywords.contains default allow := false allow if { disk_ok cpu_ok mem_ok } disk_ok if { input.disk_free_gb >= data.thresholds.min_disk_free_gb } cpu_ok if { input.cpu_load_1m <= data.thresholds.max_cpu_load } mem_ok if { input.mem_free_percent >= data.thresholds.min_mem_free_percent } reasons contains msg if { not disk_ok msg := sprintf( "disk_free_gb is %.1f, minimum required is %.1f", [input.disk_free_gb, data.thresholds.min_disk_free_gb] ) } # decision is what the CLI reads — never a bare boolean decision := { "allow": allow, "reasons": reasons, "domain": "infrastructure", "checked": { "disk_free_gb": input.disk_free_gb, "cpu_load_1m": input.cpu_load_1m, "mem_free_percent": input.mem_free_percent, }, } package swiftdeploy.infrastructure import future.keywords.if import future.keywords.contains default allow := false allow if { disk_ok cpu_ok mem_ok } disk_ok if { input.disk_free_gb >= data.thresholds.min_disk_free_gb } cpu_ok if { input.cpu_load_1m <= data.thresholds.max_cpu_load } mem_ok if { input.mem_free_percent >= data.thresholds.min_mem_free_percent } reasons contains msg if { not disk_ok msg := sprintf( "disk_free_gb is %.1f, minimum required is %.1f", [input.disk_free_gb, data.thresholds.min_disk_free_gb] ) } # decision is what the CLI reads — never a bare boolean decision := { "allow": allow, "reasons": reasons, "domain": "infrastructure", "checked": { "disk_free_gb": input.disk_free_gb, "cpu_load_1m": input.cpu_load_1m, "mem_free_percent": input.mem_free_percent, }, } package swiftdeploy.infrastructure import future.keywords.if import future.keywords.contains default allow := false allow if { disk_ok cpu_ok mem_ok } disk_ok if { input.disk_free_gb >= data.thresholds.min_disk_free_gb } cpu_ok if { input.cpu_load_1m <= data.thresholds.max_cpu_load } mem_ok if { input.mem_free_percent >= data.thresholds.min_mem_free_percent } reasons contains msg if { not disk_ok msg := sprintf( "disk_free_gb is %.1f, minimum required is %.1f", [input.disk_free_gb, data.thresholds.min_disk_free_gb] ) } # decision is what the CLI reads — never a bare boolean decision := { "allow": allow, "reasons": reasons, "domain": "infrastructure", "checked": { "disk_free_gb": input.disk_free_gb, "cpu_load_1m": input.cpu_load_1m, "mem_free_percent": input.mem_free_percent, }, } package swiftdeploy.canary import future.keywords.if import future.keywords.contains default allow := false allow if { error_rate_ok latency_ok } error_rate_ok if { input.error_rate_percent <= data.thresholds.max_error_rate_percent } latency_ok if { input.p99_latency_ms <= data.thresholds.max_p99_latency_ms } reasons contains msg if { not error_rate_ok msg := sprintf( "error_rate is %.2f%%, maximum allowed is %.2f%%", [input.error_rate_percent, data.thresholds.max_error_rate_percent] ) } decision := { "allow": allow, "reasons": reasons, "domain": "canary", "checked": { "error_rate_percent": input.error_rate_percent, "p99_latency_ms": input.p99_latency_ms, "window_seconds": input.window_seconds, }, } package swiftdeploy.canary import future.keywords.if import future.keywords.contains default allow := false allow if { error_rate_ok latency_ok } error_rate_ok if { input.error_rate_percent <= data.thresholds.max_error_rate_percent } latency_ok if { input.p99_latency_ms <= data.thresholds.max_p99_latency_ms } reasons contains msg if { not error_rate_ok msg := sprintf( "error_rate is %.2f%%, maximum allowed is %.2f%%", [input.error_rate_percent, data.thresholds.max_error_rate_percent] ) } decision := { "allow": allow, "reasons": reasons, "domain": "canary", "checked": { "error_rate_percent": input.error_rate_percent, "p99_latency_ms": input.p99_latency_ms, "window_seconds": input.window_seconds, }, } package swiftdeploy.canary import future.keywords.if import future.keywords.contains default allow := false allow if { error_rate_ok latency_ok } error_rate_ok if { input.error_rate_percent <= data.thresholds.max_error_rate_percent } latency_ok if { input.p99_latency_ms <= data.thresholds.max_p99_latency_ms } reasons contains msg if { not error_rate_ok msg := sprintf( "error_rate is %.2f%%, maximum allowed is %.2f%%", [input.error_rate_percent, data.thresholds.max_error_rate_percent] ) } decision := { "allow": allow, "reasons": reasons, "domain": "canary", "checked": { "error_rate_percent": input.error_rate_percent, "p99_latency_ms": input.p99_latency_ms, "window_seconds": input.window_seconds, }, } opa: ports: - "127.0.0.1:{{ opa_port }}:8181" # localhost only — never 0.0.0.0 opa: ports: - "127.0.0.1:{{ opa_port }}:8181" # localhost only — never 0.0.0.0 opa: ports: - "127.0.0.1:{{ opa_port }}:8181" # localhost only — never 0.0.0.0 def query_opa(manifest, package, input_data): url = f"{opa_url(manifest)}/v1/data/{package.replace('.', '/')}/decision" payload = json.dumps({"input": input_data}).encode() try: req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"}, method="POST") with urllib.request.urlopen(req, timeout=5) as resp: body = json.loads(resp.read()) result = body.get("result") if result is None: return None, "OPA returned empty result — check policy package name" return result, None except urllib.error.URLError as e: return None, f"OPA unreachable: {e.reason}" except Exception as e: return None, f"OPA query failed: {e}" def query_opa(manifest, package, input_data): url = f"{opa_url(manifest)}/v1/data/{package.replace('.', '/')}/decision" payload = json.dumps({"input": input_data}).encode() try: req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"}, method="POST") with urllib.request.urlopen(req, timeout=5) as resp: body = json.loads(resp.read()) result = body.get("result") if result is None: return None, "OPA returned empty result — check policy package name" return result, None except urllib.error.URLError as e: return None, f"OPA unreachable: {e.reason}" except Exception as e: return None, f"OPA query failed: {e}" def query_opa(manifest, package, input_data): url = f"{opa_url(manifest)}/v1/data/{package.replace('.', '/')}/decision" payload = json.dumps({"input": input_data}).encode() try: req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"}, method="POST") with urllib.request.urlopen(req, timeout=5) as resp: body = json.loads(resp.read()) result = body.get("result") if result is None: return None, "OPA returned empty result — check policy package name" return result, None except urllib.error.URLError as e: return None, f"OPA unreachable: {e.reason}" except Exception as e: return None, f"OPA query failed: {e}" def cmd_deploy(): manifest = load_manifest() host_stats = get_host_stats() # Collect host stats disk_free_gb = shutil.disk_usage("/").free / (1024 ** 3) cpu_load_1m = float(open("/proc/loadavg").read().split()[0]) mem_free_percent = (meminfo["MemAvailable"] / meminfo["MemTotal"]) * 100 # Send to OPA allowed = enforce_policy(manifest, "swiftdeploy.infrastructure", host_stats, "infrastructure") if not allowed: append_history({"event": "deploy_blocked", "reason": "infrastructure_policy"}) sys.exit(1) # Only reach here if OPA allows run(compose_cmd("up -d --build")) def cmd_deploy(): manifest = load_manifest() host_stats = get_host_stats() # Collect host stats disk_free_gb = shutil.disk_usage("/").free / (1024 ** 3) cpu_load_1m = float(open("/proc/loadavg").read().split()[0]) mem_free_percent = (meminfo["MemAvailable"] / meminfo["MemTotal"]) * 100 # Send to OPA allowed = enforce_policy(manifest, "swiftdeploy.infrastructure", host_stats, "infrastructure") if not allowed: append_history({"event": "deploy_blocked", "reason": "infrastructure_policy"}) sys.exit(1) # Only reach here if OPA allows run(compose_cmd("up -d --build")) def cmd_deploy(): manifest = load_manifest() host_stats = get_host_stats() # Collect host stats disk_free_gb = shutil.disk_usage("/").free / (1024 ** 3) cpu_load_1m = float(open("/proc/loadavg").read().split()[0]) mem_free_percent = (meminfo["MemAvailable"] / meminfo["MemTotal"]) * 100 # Send to OPA allowed = enforce_policy(manifest, "swiftdeploy.infrastructure", host_stats, "infrastructure") if not allowed: append_history({"event": "deploy_blocked", "reason": "infrastructure_policy"}) sys.exit(1) # Only reach here if OPA allows run(compose_cmd("up -d --build")) ✘ Policy [infrastructure] DENIED ! disk_free_gb is 8.2, minimum required is 10.0 Deployment blocked by policy: infrastructure ✘ Policy [infrastructure] DENIED ! disk_free_gb is 8.2, minimum required is 10.0 Deployment blocked by policy: infrastructure ✘ Policy [infrastructure] DENIED ! disk_free_gb is 8.2, minimum required is 10.0 Deployment blocked by policy: infrastructure def cmd_promote(target_mode): if target_mode == "canary": raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) allowed = enforce_policy(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}, "canary safety" ) if not allowed: sys.exit(1) def cmd_promote(target_mode): if target_mode == "canary": raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) allowed = enforce_policy(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}, "canary safety" ) if not allowed: sys.exit(1) def cmd_promote(target_mode): if target_mode == "canary": raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) allowed = enforce_policy(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}, "canary safety" ) if not allowed: sys.exit(1) def calculate_p99_latency_ms(metrics, path_filter=None): buckets = {} total_count = 0 for entry in metrics.get("http_request_duration_seconds_bucket", []): le = entry["labels"].get("le", "") if le == "+Inf": total_count = max(total_count, entry["value"]) continue buckets[float(le)] = buckets.get(float(le), 0) + entry["value"] if total_count == 0: return 0.0 p99_threshold = total_count * 0.99 for le in sorted(buckets.keys()): if buckets[le] >= p99_threshold: return round(le * 1000, 2) # seconds → milliseconds return 10000.0 def calculate_p99_latency_ms(metrics, path_filter=None): buckets = {} total_count = 0 for entry in metrics.get("http_request_duration_seconds_bucket", []): le = entry["labels"].get("le", "") if le == "+Inf": total_count = max(total_count, entry["value"]) continue buckets[float(le)] = buckets.get(float(le), 0) + entry["value"] if total_count == 0: return 0.0 p99_threshold = total_count * 0.99 for le in sorted(buckets.keys()): if buckets[le] >= p99_threshold: return round(le * 1000, 2) # seconds → milliseconds return 10000.0 def calculate_p99_latency_ms(metrics, path_filter=None): buckets = {} total_count = 0 for entry in metrics.get("http_request_duration_seconds_bucket", []): le = entry["labels"].get("le", "") if le == "+Inf": total_count = max(total_count, entry["value"]) continue buckets[float(le)] = buckets.get(float(le), 0) + entry["value"] if total_count == 0: return 0.0 p99_threshold = total_count * 0.99 for le in sorted(buckets.keys()): if buckets[le] >= p99_threshold: return round(le * 1000, 2) # seconds → milliseconds return 10000.0 def cmd_status(): while True: raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) # Query OPA for live compliance infra_dec, _ = query_opa(manifest, "swiftdeploy.infrastructure", get_host_stats()) canary_dec, _ = query_opa(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}) os.system("clear") # ... render dashboard ... append_history({ "event": "status_scrape", "error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "mode": mode_str, "chaos": chaos_str, "policy_infra_pass": infra_dec.get("allow") if infra_dec else None, "policy_canary_pass": canary_dec.get("allow") if canary_dec else None, }) time.sleep(5) def cmd_status(): while True: raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) # Query OPA for live compliance infra_dec, _ = query_opa(manifest, "swiftdeploy.infrastructure", get_host_stats()) canary_dec, _ = query_opa(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}) os.system("clear") # ... render dashboard ... append_history({ "event": "status_scrape", "error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "mode": mode_str, "chaos": chaos_str, "policy_infra_pass": infra_dec.get("allow") if infra_dec else None, "policy_canary_pass": canary_dec.get("allow") if canary_dec else None, }) time.sleep(5) def cmd_status(): while True: raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) # Query OPA for live compliance infra_dec, _ = query_opa(manifest, "swiftdeploy.infrastructure", get_host_stats()) canary_dec, _ = query_opa(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}) os.system("clear") # ... render dashboard ... append_history({ "event": "status_scrape", "error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "mode": mode_str, "chaos": chaos_str, "policy_infra_pass": infra_dec.get("allow") if infra_dec else None, "policy_canary_pass": canary_dec.get("allow") if canary_dec else None, }) time.sleep(5) SwiftDeploy Status Dashboard 2026-05-05T21:00:37Z ──────────────────────────────────────────────── ── Throughput ────────────────────────────────── req/s : 2.4 error rate : 56.45% ← red P99 latency : 5.0ms ── App State ─────────────────────────────────── mode : canary chaos : error ← red uptime : 316s ── Policy Compliance ─────────────────────────── ✔ infrastructure PASS ✘ canary FAIL ! error_rate is 56.45%, maximum allowed is 1.00% Refreshing every 5s — Ctrl+C to exit SwiftDeploy Status Dashboard 2026-05-05T21:00:37Z ──────────────────────────────────────────────── ── Throughput ────────────────────────────────── req/s : 2.4 error rate : 56.45% ← red P99 latency : 5.0ms ── App State ─────────────────────────────────── mode : canary chaos : error ← red uptime : 316s ── Policy Compliance ─────────────────────────── ✔ infrastructure PASS ✘ canary FAIL ! error_rate is 56.45%, maximum allowed is 1.00% Refreshing every 5s — Ctrl+C to exit SwiftDeploy Status Dashboard 2026-05-05T21:00:37Z ──────────────────────────────────────────────── ── Throughput ────────────────────────────────── req/s : 2.4 error rate : 56.45% ← red P99 latency : 5.0ms ── App State ─────────────────────────────────── mode : canary chaos : error ← red uptime : 316s ── Policy Compliance ─────────────────────────── ✔ infrastructure PASS ✘ canary FAIL ! error_rate is 56.45%, maximum allowed is 1.00% Refreshing every 5s — Ctrl+C to exit {"timestamp":"2026-05-05T20:34:51Z","event":"deploy","mode":"stable"} {"timestamp":"2026-05-05T20:55:22Z","event":"promote","target_mode":"canary"} {"timestamp":"2026-05-05T20:55:23Z","event":"status_scrape","error_rate_percent":62.5,"chaos":"error","policy_canary_pass":false} {"timestamp":"2026-05-05T21:01:01Z","event":"promote","target_mode":"stable"} {"timestamp":"2026-05-05T20:34:51Z","event":"deploy","mode":"stable"} {"timestamp":"2026-05-05T20:55:22Z","event":"promote","target_mode":"canary"} {"timestamp":"2026-05-05T20:55:23Z","event":"status_scrape","error_rate_percent":62.5,"chaos":"error","policy_canary_pass":false} {"timestamp":"2026-05-05T21:01:01Z","event":"promote","target_mode":"stable"} {"timestamp":"2026-05-05T20:34:51Z","event":"deploy","mode":"stable"} {"timestamp":"2026-05-05T20:55:22Z","event":"promote","target_mode":"canary"} {"timestamp":"2026-05-05T20:55:23Z","event":"status_scrape","error_rate_percent":62.5,"chaos":"error","policy_canary_pass":false} {"timestamp":"2026-05-05T21:01:01Z","event":"promote","target_mode":"stable"}

Mode Changes | Timestamp | From | To |

|-----------|------|----|| 2026-05-05T20:34:51Z | unknown | stable || 2026-05-05T20:55:22Z | stable | canary || 2026-05-05T21:01:01Z | canary | stable |

Policy Violations | Timestamp | Infrastructure | Canary | Error Rate | P99 ||-----------|---------------|--------|------------|-----|| 2026-05-05T20:55:23Z | ✔ PASS | ✘ FAIL | 62.5% | 5.0ms |

| 2026-05-05T20:55:28Z | ✔ PASS | ✘ FAIL | 63.6% | 5.0ms |

Command

Copy

$

Mode Changes | Timestamp | From | To |

|-----------|------|----|| 2026-05-05T20:34:51Z | unknown | stable || 2026-05-05T20:55:22Z | stable | canary || 2026-05-05T21:01:01Z | canary | stable |

Policy Violations | Timestamp | Infrastructure | Canary | Error Rate | P99 ||-----------|---------------|--------|------------|-----|| 2026-05-05T20:55:23Z | ✔ PASS | ✘ FAIL | 62.5% | 5.0ms |

| 2026-05-05T20:55:28Z | ✔ PASS | ✘ FAIL | 63.6% | 5.0ms |

Command

Copy

$

Mode Changes | Timestamp | From | To |

|-----------|------|----|| 2026-05-05T20:34:51Z | unknown | stable || 2026-05-05T20:55:22Z | stable | canary || 2026-05-05T21:01:01Z | canary | stable |

Policy Violations | Timestamp | Infrastructure | Canary | Error Rate | P99 ||-----------|---------------|--------|------------|-----|| 2026-05-05T20:55:23Z | ✔ PASS | ✘ FAIL | 62.5% | 5.0ms |

| 2026-05-05T20:55:28Z | ✔ PASS | ✘ FAIL | 63.6% | 5.0ms |

Command

Copy

$ -weight: 500;">docker exec swiftdeploy-app ss -tlnp # tcp LISTEN 0.0.0.0:3000 -weight: 500;">docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())" # b'{"-weight: 500;">status": "ok", ...}' ← works via exec, not via healthcheck -weight: 500;">docker exec swiftdeploy-app ss -tlnp # tcp LISTEN 0.0.0.0:3000 -weight: 500;">docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())" # b'{"-weight: 500;">status": "ok", ...}' ← works via exec, not via healthcheck -weight: 500;">docker exec swiftdeploy-app ss -tlnp # tcp LISTEN 0.0.0.0:3000 -weight: 500;">docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())" # b'{"-weight: 500;">status": "ok", ...}' ← works via exec, not via healthcheck # OLD — crashes on latest OPA allow { disk_ok } reasons[msg] { not disk_ok msg := "..." } # OLD — crashes on latest OPA allow { disk_ok } reasons[msg] { not disk_ok msg := "..." } # OLD — crashes on latest OPA allow { disk_ok } reasons[msg] { not disk_ok msg := "..." } # NEW — Rego v1 required syntax allow if { disk_ok } reasons contains msg if { not disk_ok msg := "..." } # NEW — Rego v1 required syntax allow if { disk_ok } reasons contains msg if { not disk_ok msg := "..." } # NEW — Rego v1 required syntax allow if { disk_ok } reasons contains msg if { not disk_ok msg := "..." } -weight: 500;">docker: invalid reference format: repository name (Desktop/HNG/hng-swiftdeploy/nginx.conf) must be lowercase -weight: 500;">docker: invalid reference format: repository name (Desktop/HNG/hng-swiftdeploy/nginx.conf) must be lowercase -weight: 500;">docker: invalid reference format: repository name (Desktop/HNG/hng-swiftdeploy/nginx.conf) must be lowercase # 1. Build the image -weight: 500;">docker build -t swift-deploy-1-node:latest . # 2. Validate pre-flight checks ./swiftdeploy validate # 3. Deploy (OPA policy check runs first) ./swiftdeploy deploy # 4. Verify metrics -weight: 500;">curl http://localhost:8080/metrics # 5. Verify OPA isolation -weight: 500;">curl http://127.0.0.1:8181/health # works — internal -weight: 500;">curl http://localhost:8080/v1/data # 404 — nginx blocks it # 6. Launch -weight: 500;">status dashboard ./swiftdeploy -weight: 500;">status # 7. Promote to canary (OPA canary policy check runs first) ./swiftdeploy promote canary # 8. Inject chaos -weight: 500;">curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "error", "rate": 0.5}' # 9. Watch -weight: 500;">status dashboard catch it — canary policy FAIL visible in real time # 10. Recover -weight: 500;">curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "recover"}' # 11. Promote back to stable ./swiftdeploy promote stable # 12. Generate audit report ./swiftdeploy audit cat audit_report.md # 13. Teardown ./swiftdeploy teardown --clean # 1. Build the image -weight: 500;">docker build -t swift-deploy-1-node:latest . # 2. Validate pre-flight checks ./swiftdeploy validate # 3. Deploy (OPA policy check runs first) ./swiftdeploy deploy # 4. Verify metrics -weight: 500;">curl http://localhost:8080/metrics # 5. Verify OPA isolation -weight: 500;">curl http://127.0.0.1:8181/health # works — internal -weight: 500;">curl http://localhost:8080/v1/data # 404 — nginx blocks it # 6. Launch -weight: 500;">status dashboard ./swiftdeploy -weight: 500;">status # 7. Promote to canary (OPA canary policy check runs first) ./swiftdeploy promote canary # 8. Inject chaos -weight: 500;">curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "error", "rate": 0.5}' # 9. Watch -weight: 500;">status dashboard catch it — canary policy FAIL visible in real time # 10. Recover -weight: 500;">curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "recover"}' # 11. Promote back to stable ./swiftdeploy promote stable # 12. Generate audit report ./swiftdeploy audit cat audit_report.md # 13. Teardown ./swiftdeploy teardown --clean # 1. Build the image -weight: 500;">docker build -t swift-deploy-1-node:latest . # 2. Validate pre-flight checks ./swiftdeploy validate # 3. Deploy (OPA policy check runs first) ./swiftdeploy deploy # 4. Verify metrics -weight: 500;">curl http://localhost:8080/metrics # 5. Verify OPA isolation -weight: 500;">curl http://127.0.0.1:8181/health # works — internal -weight: 500;">curl http://localhost:8080/v1/data # 404 — nginx blocks it # 6. Launch -weight: 500;">status dashboard ./swiftdeploy -weight: 500;">status # 7. Promote to canary (OPA canary policy check runs first) ./swiftdeploy promote canary # 8. Inject chaos -weight: 500;">curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "error", "rate": 0.5}' # 9. Watch -weight: 500;">status dashboard catch it — canary policy FAIL visible in real time # 10. Recover -weight: 500;">curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "recover"}' # 11. Promote back to stable ./swiftdeploy promote stable # 12. Generate audit report ./swiftdeploy audit cat audit_report.md # 13. Teardown ./swiftdeploy teardown --clean - The Problem We're Solving - Architecture Overview - Stage 4A — The Engine The Manifest The Python HTTP Service The Dockerfile The Templates The CLI The Two Deployment Modes - The Manifest - The Python HTTP Service - The Dockerfile - The Templates - The Two Deployment Modes - Stage 4B — The Eyes and Brain Prometheus Metrics OPA Policy Engine Gated Lifecycle The Status Dashboard The Audit Trail - Prometheus Metrics - OPA Policy Engine - Gated Lifecycle - The Status Dashboard - The Audit Trail - The Debugging Sagas - Full Deployment Walkthrough - Key Lessons Learned - The Manifest - The Python HTTP Service - The Dockerfile - The Templates - The Two Deployment Modes - Prometheus Metrics - OPA Policy Engine - Gated Lifecycle - The Status Dashboard - The Audit Trail - Write an Nginx config - Write a Docker Compose file - Run Docker commands - Check if things are healthy - Hope nobody deploys when the disk is full - Hope nobody promotes a canary that's throwing 60% errors - slow — injects time.sleep(N) before responding, simulating a slow upstream - error — uses random.random() < rate to return 500 on a configurable percentage of requests - recover — clears all chaos state, returning to normal behaviour - Custom log format — ISO timestamp, -weight: 500;">status code, response time, upstream IP, full request on one line - JSON error bodies — APIs need machine-readable errors, not nginx HTML pages - proxy_pass_header X-Mode — nginx strips custom headers by default; this forwards the canary header through to clients - keepalive 32 — maintains 32 persistent connections to the upstream, reducing connection overhead - Adds X-Mode: canary to every response — callers can identify which mode they're hitting - Unlocks /chaos — lets you simulate slow responses, random errors, then recover - You promote with ./swiftdeploy promote canary, stress test, then ./swiftdeploy promote stable to roll back - Declarative infrastructure — describe what you want, generate everything else - Immutable configs — generated files are outputs, never inputs - Policy as code — OPA enforces safety standards that can't be bypassed - Observability — Prometheus metrics feed the dashboard and the policy engine - Audit trail — every event recorded, every violation surfaced