Tools: Report: I Built a Production-Grade DevSecOps Platform From Scratch — Here's Every Decision I Made

Tools: Report: I Built a Production-Grade DevSecOps Platform From Scratch — Here's Every Decision I Made

The Goal

Phase 1 — DevSecOps CI Pipeline

The Dockerfile

Phase 2 — Infrastructure as Code with Terraform

Phase 3 — Automated Deployment

Phase 4 — Kubernetes + GitOps with ArgoCD

The GitOps Loop

Phase 5 — Full Observability Stack

AlertManager Rules

What I'd Do Differently

The Full Stack at a Glance Most DevOps tutorials show you how to push a Docker image to DockerHub and call it a day. This is not that post. I spent weeks building a platform that mirrors what actually runs inside companies like Stripe, Notion, or Cloudflare — automated security gates, infrastructure as code, self-healing Kubernetes deployments, and a full observability stack that pages you on Slack at 3am. Every decision was deliberate. Every tool earns its place. Here's the whole thing, phase by phase. The challenge I set myself: build a platform where: The app itself is intentionally boring: a Flask API with three endpoints. The infrastructure is the point. Security as an afterthought is how you end up on HaveIBeenPwned. I baked it into the pipeline from day one. Every push to main triggers four sequential checks before a single byte gets deployed: TruffleHog scans every commit diff for leaked API keys, tokens, and passwords — not just regex patterns, but verified against live services. Safety audits Python dependencies against the CVE database. Trivy scans the built container image for OS-level vulnerabilities. The pipeline only continues to build-and-push if all three pass. Security is a gate, not a suggestion. Multi-stage builds are non-negotiable in production. The builder stage installs dependencies; the final image copies only the installed packages — not pip, not build tools, not anything that expands the attack surface. Running as uid 10001 means if the container is ever compromised, the attacker gets a user with zero system privileges — not root. This is a hard requirement in enterprise container security audits in 2025. The result: an image that's roughly 60% smaller than a naive single-stage build, with significantly fewer Trivy findings. The rule: if it can't be terraform apply'd, it doesn't exist. I provisioned the full AWS environment — VPC, subnets, security groups, EC2, S3, IAM roles, and EKS — in code. No manual console clicks, ever. A few decisions worth explaining: Why RDS gets a private subnet. The database should never be reachable from the internet, only from within the VPC. This is enforced at the network layer, not just via security groups. Why I generate the EC2 SSH key via Terraform. No manual key generation, no keys sitting in someone's Downloads folder. The private key is a Terraform output marked sensitive = true — it exists in state, not in source control. Why S3 for Terraform state. Local .tfstate files go out of sync between teammates and are catastrophic to lose. S3 with versioning means state is always current and recoverable. The payoff: terraform apply brings up the entire environment in about 15 minutes. terraform destroy tears it down and stops the billing instantly. Reproducible, auditable, version-controlled infrastructure. The pipeline pushes two tags on every successful build: latest and the exact git SHA. Why both? latest is for convenience. The SHA tag is for precision — you can roll back to any exact commit with a single command. This matters when you're debugging a production incident at midnight and need to know exactly what's running. This is where it gets interesting. The EKS cluster runs the app via a Helm chart. The chart manages replicas, resource limits, health probes, and autoscaling: The HPA scaling target is 70%, not 90%. At 90% you're already overwhelmed — new pods take time to start and warm up. 70% gives the cluster headroom to scale before traffic saturates the existing pods. Here's the part that makes this different from "deploy via kubectl": When GitHub Actions updates the image tag in values.yaml and pushes the commit: Git is the single source of truth. The cluster is a reflection of the repo, not an independent entity that drifts over time. You cannot operate what you cannot observe. The Flask app exposes custom Prometheus metrics at /metrics: A ServiceMonitor tells Prometheus to scrape the endpoint every 15 seconds. From there, four Grafana panels give full visibility: Four alerts fire to a Slack #devops-alerts channel: The for: 2m duration on error rate prevents false positives from a momentary spike. The alert only fires if the condition holds for two consecutive minutes — sustained degradation, not noise. A few things I'd change building this again: Multi-environment from the start. One Terraform workspace and one ArgoCD app works fine for learning, but the first thing you'd add in a real org is separate staging and prod environments with promotion gates between them. Spot instances on the node group. The EKS worker nodes run on t3.small on-demand. Mixing in Spot instances with appropriate interruption handling would cut the compute cost by 60-70%. OpenTelemetry instead of manual instrumentation. Hand-instrumenting the Flask app with Prometheus counters and histograms works, but OpenTelemetry gives you traces, metrics, and logs through a single SDK — and it's vendor-neutral. The README walks through prerequisites, getting started locally with Docker Compose, provisioning the full cloud stack, and connecting to the ArgoCD and Grafana dashboards. If any of this is useful or you're building something similar, drop a comment. I'm particularly interested in talking to people who've taken GitOps patterns further — multi-cluster setups, progressive delivery with Flagger, that kind of thing. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ jobs: security-scan: steps: - uses: trufflesecurity/trufflehog@main # leaked secrets with: extra_args: --only-verified - run: -weight: 500;">pip -weight: 500;">install safety && safety check # CVE audit on deps - run: -weight: 500;">docker build -t devops-app ./backend # build locally for scanning - uses: aquasecurity/trivy-action@master # OS-level vuln scan with: severity: 'CRITICAL,HIGH' jobs: security-scan: steps: - uses: trufflesecurity/trufflehog@main # leaked secrets with: extra_args: --only-verified - run: -weight: 500;">pip -weight: 500;">install safety && safety check # CVE audit on deps - run: -weight: 500;">docker build -t devops-app ./backend # build locally for scanning - uses: aquasecurity/trivy-action@master # OS-level vuln scan with: severity: 'CRITICAL,HIGH' jobs: security-scan: steps: - uses: trufflesecurity/trufflehog@main # leaked secrets with: extra_args: --only-verified - run: -weight: 500;">pip -weight: 500;">install safety && safety check # CVE audit on deps - run: -weight: 500;">docker build -t devops-app ./backend # build locally for scanning - uses: aquasecurity/trivy-action@master # OS-level vuln scan with: severity: 'CRITICAL,HIGH' FROM python:3.11-slim AS builder WORKDIR /app RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir flask prometheus-client FROM python:3.11-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages COPY --from=builder /usr/local/bin /usr/local/bin COPY app.py . RUN useradd -u 10001 appuser && chown -R appuser:appuser /app USER appuser EXPOSE 5000 CMD ["python", "app.py"] FROM python:3.11-slim AS builder WORKDIR /app RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir flask prometheus-client FROM python:3.11-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages COPY --from=builder /usr/local/bin /usr/local/bin COPY app.py . RUN useradd -u 10001 appuser && chown -R appuser:appuser /app USER appuser EXPOSE 5000 CMD ["python", "app.py"] FROM python:3.11-slim AS builder WORKDIR /app RUN -weight: 500;">pip -weight: 500;">install --no-cache-dir flask prometheus-client FROM python:3.11-slim WORKDIR /app COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages COPY --from=builder /usr/local/bin /usr/local/bin COPY app.py . RUN useradd -u 10001 appuser && chown -R appuser:appuser /app USER appuser EXPOSE 5000 CMD ["python", "app.py"] # The entire network fabric resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true } resource "aws_subnet" "public" { cidr_block = "10.0.1.0/24" ... } resource "aws_subnet" "private" { cidr_block = "10.0.2.0/24" ... } # The entire network fabric resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true } resource "aws_subnet" "public" { cidr_block = "10.0.1.0/24" ... } resource "aws_subnet" "private" { cidr_block = "10.0.2.0/24" ... } # The entire network fabric resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true } resource "aws_subnet" "public" { cidr_block = "10.0.1.0/24" ... } resource "aws_subnet" "private" { cidr_block = "10.0.2.0/24" ... } resource "tls_private_key" "rsa_key" { algorithm = "RSA" rsa_bits = 4096 } resource "aws_key_pair" "app_key" { key_name = "${var.project_name}-key" public_key = tls_private_key.rsa_key.public_key_openssh } resource "tls_private_key" "rsa_key" { algorithm = "RSA" rsa_bits = 4096 } resource "aws_key_pair" "app_key" { key_name = "${var.project_name}-key" public_key = tls_private_key.rsa_key.public_key_openssh } resource "tls_private_key" "rsa_key" { algorithm = "RSA" rsa_bits = 4096 } resource "aws_key_pair" "app_key" { key_name = "${var.project_name}-key" public_key = tls_private_key.rsa_key.public_key_openssh } - uses: -weight: 500;">docker/build-push-action@v5 with: context: ./backend push: true tags: | ${{ env.IMAGE_NAME }}:latest ${{ env.IMAGE_NAME }}:${{ github.sha }} - uses: -weight: 500;">docker/build-push-action@v5 with: context: ./backend push: true tags: | ${{ env.IMAGE_NAME }}:latest ${{ env.IMAGE_NAME }}:${{ github.sha }} - uses: -weight: 500;">docker/build-push-action@v5 with: context: ./backend push: true tags: | ${{ env.IMAGE_NAME }}:latest ${{ env.IMAGE_NAME }}:${{ github.sha }} # values.yaml replicaCount: 2 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi autoscaling: minReplicas: 2 maxReplicas: 5 targetCPUUtilizationPercentage: 70 # values.yaml replicaCount: 2 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi autoscaling: minReplicas: 2 maxReplicas: 5 targetCPUUtilizationPercentage: 70 # values.yaml replicaCount: 2 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 200m memory: 256Mi autoscaling: minReplicas: 2 maxReplicas: 5 targetCPUUtilizationPercentage: 70 # argocd/application.yaml syncPolicy: automated: prune: true # delete resources removed from Git selfHeal: true # revert any manual cluster changes # argocd/application.yaml syncPolicy: automated: prune: true # delete resources removed from Git selfHeal: true # revert any manual cluster changes # argocd/application.yaml syncPolicy: automated: prune: true # delete resources removed from Git selfHeal: true # revert any manual cluster changes REQUEST_COUNT = Counter( 'app_requests_total', 'Total number of requests', ['method', 'endpoint', '-weight: 500;">status'] ) REQUEST_LATENCY = Histogram( 'app_request_latency_seconds', 'Request duration', ['endpoint'] ) REQUEST_COUNT = Counter( 'app_requests_total', 'Total number of requests', ['method', 'endpoint', '-weight: 500;">status'] ) REQUEST_LATENCY = Histogram( 'app_request_latency_seconds', 'Request duration', ['endpoint'] ) REQUEST_COUNT = Counter( 'app_requests_total', 'Total number of requests', ['method', 'endpoint', '-weight: 500;">status'] ) REQUEST_LATENCY = Histogram( 'app_request_latency_seconds', 'Request duration', ['endpoint'] ) - alert: HighErrorRate expr: rate(app_requests_total{-weight: 500;">status=~"5.."}[5m]) > 0.1 for: 2m labels: severity: critical - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: critical - alert: HighErrorRate expr: rate(app_requests_total{-weight: 500;">status=~"5.."}[5m]) > 0.1 for: 2m labels: severity: critical - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: critical - alert: HighErrorRate expr: rate(app_requests_total{-weight: 500;">status=~"5.."}[5m]) > 0.1 for: 2m labels: severity: critical - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0 for: 5m labels: severity: critical - No code reaches production without passing security checks — automatically - Infrastructure is version-controlled — no manual clicking in AWS consoles - Deployments are zero-touch-weight: 500;">git push is the only operator action - The cluster corrects itself — manual changes get reverted, failed deploys roll back - You can see everything — metrics, dashboards, and alerts firing to Slack - ArgoCD detects the change in Git within seconds - Triggers a rolling -weight: 500;">update on the cluster — zero downtime - If health checks fail post-deploy, ArgoCD auto-rolls back to the last healthy state - If someone manually -weight: 500;">kubectl apply's something directly to the cluster, ArgoCD reverts it within minutes