Tools: Essential Guide: SwiftDeploy: Building an Observable, Policy-Driven Deployment Engine with OPA
IntroductionAs part of the HNG Internship DevOps Track Stage 4B, I extended my Stage 4A project — SwiftDeploy — into a fully observable, policy-aware deployment platform.In Stage 4A, SwiftDeploy could: Stage 4B transformed it into something much closer to a real production deployment system by adding: The result is a deployment tool that not only deploys services, but also decides whether deployments are safe enough to proceed. The Core Philosophy: One Manifest, Everything Else GeneratedSwiftDeploy is built around a single principle: manifest.yaml is the only file you should ever edit manually. Everything else is generated from it.Here is the manifest structure:services: name: app image: swift-deploy-1-node:latest port: 3000 version: "1.0.0" mode: stablenginx: image: nginx:latest port: 8080 proxy_timeout: 30network: name: swiftdeploy-net driver_type: bridgeFrom this manifest, the CLI generates: This design provides: The grader can delete all generated files and rerun:./swiftdeploy initand the entire stack regenerates correctly. Architecture OverviewThe system architecture consists of four major components:User ↓Nginx Reverse Proxy ↓Flask API Service ↓Prometheus Metrics ↓SwiftDeploy CLI ↓OPA Policy EngineThe deployment stack includes: The SwiftDeploy CLIThe heart of the project is the swiftdeploy executable.It is a Python-based CLI tool that manages the entire deployment lifecycle.Supported CommandsCommandPurposeinitGenerate config files from templatesvalidateRun pre-flight validation checksdeployStart the stackpromote canarySwitch deployment into canary modepromote stableReturn deployment to stable modestatusLive metrics dashboardauditGenerate audit reportteardownDestroy containers and networks The API ServiceThe API service is a Flask application that supports both stable and canary deployment modes.Deployment mode is controlled through the MODE environment variable.EndpointsRoot EndpointGET /Returns: Example:{ "message": "Welcome to SwiftDeploy", "mode": "stable", "version": "1.0.0"} Health EndpointGET /healthzReturns: Chaos EndpointPOST /chaosAvailable only in canary mode.Supports:{ "mode": "slow", "duration": 3 }{ "mode": "error", "rate": 0.5 }{ "mode": "recover" }This endpoint was used to simulate: Instrumentation: The /metrics EndpointOne of the biggest upgrades in Stage 4B was observability.I instrumented the Flask service using the prometheus_client library.The service now exposes:GET /metricsin Prometheus text format. Metrics CollectedRequest Throughputhttp_requests_totalLabels: methodpathstatus_code Example:http_requests_total{method="GET",path="/",status_code="200"} 152Request Latencyhttp_request_duration_secondsHistogram used for: app_uptime_secondsTracks process uptime. Deployment Modeapp_modeValues: Chaos Statechaos_activeValues: Why Metrics MatterWithout metrics: Metrics became the foundation for: Open Policy Agent (OPA): The Brain of SwiftDeployThe most important design principle in Stage 4B was:The CLI must never make allow/deny decisions itself.All decision-making lives entirely inside OPA.SwiftDeploy only: This separation makes the system: OPA Policy DomainsI separated policies into independent domains.Each policy: Infrastructure PolicyRuns before deployment.Blocks deployment when: Rego Examplepackage infradefault allow = falseallow { input.disk_free_gb >= data.thresholds.disk_free_gb input.cpu_load <= data.thresholds.cpu_load} Canary Safety PolicyRuns before promotion.Blocks promotion when: Policy ThresholdsThresholds are stored separately in:policies/data.jsonExample:{ "thresholds": { "disk_free_gb": 10, "cpu_load": 2.0, "error_rate": 0.01, "p99_latency_ms": 500 }} This prevents external users from: Pre-Deploy Policy EnforcementBefore deployment, SwiftDeploy collects: Example payload:{ "disk_free_gb": 8.5, "cpu_load": 2.4}OPA evaluates the payload. Deployment blocked:Infrastructure policy violationThe deployment never proceeds. Canary Safety EnforcementBefore promotion, SwiftDeploy: If the canary is unhealthy: The Status DashboardThe status command provides a live operational dashboard../swiftdeploy statusThe dashboard: Example output:SwiftDeploy Status Dashboard==================================================Mode: canaryChaos: errorError Rate: 52%P99 Latency: 430msPolicy Compliance:✓ Infrastructure policy: PASSING✗ Canary safety policy: FAILING Chaos EngineeringThis was one of the most interesting parts of the project.I intentionally injected: Example:curl -X POST http://localhost:8080/chaos -d '{"mode":"error","rate":0.9}' promotions were blockedThis validated that: metrics were accurate policies were functional safety gates worked correctly Example entry:{ "timestamp": "2026-05-06T12:00:00", "mode": "canary", "error_rate": 0.52} Audit Report GenerationRunning:./swiftdeploy auditgenerates:audit_report.md Example:| Timestamp | Policy | Details ||-----------|--------|---------|| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50% | Challenges Faceda. Python Virtual Environment IssuesUbuntu’s externally-managed Python environment caused repeated package installation failures. b. Nginx Validation ProblemsGenerated Nginx configs initially failed validation due to unresolved upstream references. d. OPA Failure HandlingThe CLI had to gracefully handle: Lessons LearnedDeclarative Systems Scale BetterA single source of truth drastically reduces configuration drift. Observability Is MandatoryWithout metrics: Chaos Engineering Builds ConfidenceBreaking the system intentionally proved that: Automation Must Be ExplainableEvery policy response included human-readable reasoning.This made debugging and operational decisions much easier. Final ThoughtsStage 4B transformed SwiftDeploy from a deployment generator into a lightweight deployment platform with: The project demonstrated how: Templates let you quickly answer FAQs or store snippets for re-use. as well , this person and/or - generate infrastructure files from a declarative manifest- deploy containers using Docker Compose- manage deployment modes (stable/canary)- configure Nginx automatically - Prometheus instrumentation- Open Policy Agent (OPA) policy enforcement- live operational dashboards- deployment safety gates- audit logging and reporting- chaos engineering validation - generated/nginx.conf- generated/docker-compose.yml- OPA runtime configuration - consistency- reproducibility- environment portability- infrastructure-as-code discipline - Flask application container- Nginx reverse proxy- Open Policy Agent (OPA)- internal Docker network- named log volumes - deployment mode - health status- application uptime - degraded latency- random failures- recovery workflows - latency analysis- P99 calculation - deployments are blind- failures become invisible- canary safety cannot be enforced - policy decisions- promotion safety - gathers data- sends context to OPA- acts on the response - maintainable - answers one question- owns its own logic- operates independently - disk free space is below 10GB- CPU load exceeds 2.0 - error rate exceeds 1%- P99 latency exceeds 500msRego Examplepackage canarydefault allow = falseallow { input.error_rate <= data.thresholds.error_rate input.p99_latency_ms <= data.thresholds.p99_latency_ms} - hardcoded values- duplicated configuration- policy couplingOPA IsolationThe OPA container runs on an internal Docker network.It is intentionally NOT exposed through Nginx.Only the CLI can access OPA directly via:http://localhost:8181 - querying policies- bypassing deployment logic- inspecting internal rulesThis mirrors real production security architecture. - available disk space - scrapes /metrics- calculates error rate- calculates P99 latency- submits metrics to OPA - promotion is blocked- rollout is preventedThis introduces production-grade deployment safety. - refreshes continuously- scrapes live metrics- calculates request rate- calculates P99 latency- evaluates policy compliance- appends results to history.jsonl - high error rates- slow responses - metrics reflected failures- policies began failing- promotions were blockedThis validated that:- metrics were accurate- policies were functional- safety gates worked correctly - status scrape- policy violationis appended to:history.jsonl - deployment timeline- mode changes- chaos injections- policy violations - recreating the virtual environment- installing dependencies inside the venv only - validate only inside container context- avoid host-side upstream resolution - Calculating:- P99 latencyfrom Prometheus text format required careful parsing and aggregation. - OPA downtime- connection failures- malformed responsesThe system never crashes when OPA becomes unavailable. - policy enforcement becomes impossible- deployments become blind- Policy Engines Should Be Isolated- Keeping OPA internal-only mirrors real enterprise architectures. - metrics were accurate- policies were effective- safety mechanisms worked - observability- deployment safety - policy engines- infrastructure generation- deployment orchestrationcan work together to create reliable deployment systems.Most importantly, it reinforced a key DevOps principle:Safe automation is more valuable than fast automation.