Tools: CI/CD Pipeline Best Practices: A Production-Ready Guide for 2026

Tools: CI/CD Pipeline Best Practices: A Production-Ready Guide for 2026

CI/CD Pipeline Best Practices: A Production-Ready Guide for 2026

Why CI/CD Best Practices Matter (And What Breaks Without Them)

Foundation: Version Control & Branching Strategy

Automated Testing as a Quality Gate

Infrastructure as Code (IaC) Integration

Security: Shift-Left in the Pipeline

Deployment Strategies That Reduce Risk

Pipeline Performance Optimization

Observability & Monitoring Integration

Rollback Strategy and Automated Recovery

Common Pitfalls and How to Avoid Them

Implementation Roadmap: Where to Start

Phase 1 — Week 1: Core Gates (Highest ROI)

Phase 2 — Weeks 2–4: Quality & Speed

Phase 3 — Month 2+: Advanced Practices

Measuring Success: DORA Metrics

Putting It Together Every engineering team eventually reaches the same inflection point: deployments become terrifying. A change that takes 20 minutes to write takes three days to safely ship. The pipeline that was meant to accelerate you is now the thing you dread. The difference between teams that deploy confidently multiple times a day and teams that schedule deployment windows at 2 AM usually isn't tooling — it's the specific practices baked into their pipelines. This guide covers 12 CI/CD pipeline best practices that actually matter in production, grounded in the failure scenarios each one prevents. We'll show implementations across GitHub Actions, GitLab CI, and Jenkins so you can adapt them regardless of your stack, and close with a phased rollout roadmap so you know where to start. The appeal of CI/CD is obvious: faster feedback, fewer integration headaches, reduced deployment risk. But poorly structured pipelines create their own category of failures. The DORA metrics research from Google is instructive here. Elite-performing engineering organizations deploy to production multiple times per day, with a change failure rate below 5%, and recover from incidents in under one hour. The gap between elite and low-performing teams isn't primarily one of tooling sophistication — it's practice quality. The deployment velocity paradox: Teams without solid CI/CD practices often respond to instability by adding gates — manual approvals, deployment freezes, extended QA cycles. Each gate slows the feedback loop, which causes larger, riskier batches of changes, which causes more failures, which causes more gates. The practices below break this cycle. What we're optimizing for: Without this: A team at a SaaS company I consulted for maintained 14 long-lived feature branches simultaneously. The integration sprint before each release took two weeks of merge conflicts, introduced regressions from code written months earlier, and resulted in a 40% change failure rate. The most production-proven branching strategy for CI/CD is trunk-based development: all engineers commit frequently to a single main branch, keeping branches short-lived (under two days). Feature flags decouple deployment from feature release. If your team isn't ready for full trunk-based development, a disciplined GitFlow variant works — but enforce branch lifetime limits and require rebase-before-merge to keep the integration surface manageable. Branch protection rules are non-negotiable. At minimum: The enforce_admins: true (or equivalent) is the detail most teams skip. Every "I'll just push directly this once" incident that causes a major outage was a one-time exception. Without this: Without test gates, the pipeline becomes a deployment conveyor belt that ships regressions as fast as engineers introduce them. A startup I worked with had a 35-minute manual QA cycle that blocked deployments — they cut it to zero by adding automated tests, but only after shipping a broken checkout flow to 100% of users during a sales event. Structure your test suite around the testing pyramid: The key insight most teams miss: test order matters. Run fast tests first. A pipeline that runs E2E tests before unit tests will waste 20+ minutes on failures that a 30-second lint check would have caught. Flaky test management: Flaky tests are worse than no tests — they train engineers to ignore failures. Implement a zero-tolerance policy: any test that fails intermittently gets quarantined immediately to a separate flaky suite and doesn't block the pipeline until fixed. Track flakiness rates by test and by author. Coverage thresholds prevent test debt accumulation: Don't aim for 100% — coverage theater (writing tests that hit lines but assert nothing) is real. Set thresholds that prevent regression, not ones that optimize the metric. Without this: Manual infrastructure changes are the silent killer of deployment reliability. A team deploys code that works perfectly against their manually-configured staging environment — and fails in production because someone added a firewall rule six months ago and no one documented it. Treat infrastructure like application code: version it, review it, test it in the pipeline. Drift detection catches when your actual infrastructure diverges from what's in code — usually from manual emergency changes that were never committed: Without this: A Node.js API at a fintech company shipped a dependency with a known critical CVE for four months after the vulnerability was published. No one noticed because security scanning was done quarterly by a separate team. By the time it was patched, it was a board-level incident. Shift-left means finding security issues at the point where they're cheapest to fix: during development, not in production. Secrets management: Never store secrets in code or pipeline environment variables set in the UI. Use a secrets manager (AWS Secrets Manager, HashiCorp Vault, GitHub Secrets for non-sensitive CI values) with short-lived credential patterns. Rotate secrets automatically and treat any committed secret as permanently compromised. Without this: Big-bang deployments are binary — they work or they don't, and rollback means re-deploying the previous version (assuming you kept it). A mid-size e-commerce team lost $80K in a two-hour incident because a payment service regression wasn't caught until 100% of users hit it. Blue-green deployment maintains two identical environments. The new version deploys to the inactive environment, gets validated, and traffic switches atomically. Rollback is a DNS or load balancer change. Canary releases shift traffic gradually and watch metrics before full rollout: Feature flags decouple deployment from feature release — ship code on Monday, enable the feature on Friday after the demo. Tools like LaunchDarkly, Unleash, or a simple database-backed flag service give you instant rollback without a redeployment. Without this: A 45-minute CI pipeline trains engineers to stop watching it. Context switching happens, PRs pile up, and what was meant to be rapid iteration becomes a slow ceremony. Target: sub-15 minute full pipeline for the critical path. Parallelization is the highest-leverage optimization: Dependency caching eliminates redundant package downloads: Layer caching for Docker builds — order Dockerfile instructions from least to most frequently changed: Skip unchanged paths to avoid running the full pipeline when only docs changed: Without this: Teams end up with pipeline scripts that directly kubectl apply or ansible-playbook from CI, creating a situation where the cluster state is only reproducible if you know which pipeline job last touched it. Recovering from a cluster incident becomes an archaeology project. GitOps makes the desired cluster state declarative and version-controlled. A GitOps controller (ArgoCD, Flux) continuously reconciles actual state with desired state in git. The CI pipeline's job changes from "deploy the thing" to "update the manifest repo" — a smaller, safer, auditable operation. Every production change has a corresponding git commit with author, message, and timestamp. Without this: You get an alert that a deployment caused a spike in error rates from your monitoring tool — but you have no record that a deployment even happened in that monitoring tool, so you're correlating timestamps manually. Track deployments as events in your observability stack: Build a pipeline metrics dashboard tracking: build duration over time (catches pipeline regression), test success rate (catches flaky test growth), deployment frequency (the primary DORA metric), and rollback rate (a leading indicator of change failure rate). Without this: The worst time to design your rollback strategy is during an incident. Teams without a pre-baked rollback plan spend precious MTTR minutes in Slack discussing how to revert. Define rollback as a one-command operation: For database migrations, the standard recommendation is: all migrations must be backwards-compatible with the previous version of the application. This means never dropping a column in the same release that removes it from application code. Over-engineering the initial pipeline: The urge to implement the full list on day one leads to a complex pipeline that nobody understands and everyone wants to bypass. Start with: version control gates, unit tests, and automated deployment. Add practices as pain emerges. Ignoring pipeline maintenance debt: Pipeline configurations rot. Dependencies go stale, cached layers become huge, test environments drift. Schedule regular pipeline health reviews the same way you schedule dependency updates. Skipping rollback testing: Most teams have a rollback procedure but have never actually run it against production. Practice rollback in staging quarterly. The first time your rollback procedure runs should not be during a P0 incident. Manual approvals as bottlenecks: Manual approval gates feel safe but accumulate latency. If a deployment requires four manual approvals and each approver has a two-hour response time, you have an eight-hour deployment lead time floor. Replace manual approvals with automated quality gates wherever possible. Treating the pipeline as a black box: Engineers who don't understand the pipeline's structure can't improve it or debug it when it breaks. Document pipeline architecture, ensure every engineer understands the stages, and conduct blameless pipeline post-mortems after significant failures. The biggest mistake teams make is attempting a complete pipeline overhaul. Instead, layer improvements. Practice prioritization matrix: When choosing what to implement next, score each practice on two dimensions: High impact + low complexity: branch protection, secret scanning, dependency caching. High impact + medium complexity: canary releases, automated rollback. High impact + high complexity: full GitOps implementation. These last ones are worth the investment but shouldn't come first. DORA metrics are the industry-standard benchmark for software delivery performance. They correlate strongly with organizational performance and are what elite engineering organizations track. Track these monthly. Plot trends over quarters. The goal isn't to hit "elite" immediately — it's to be consistently improving. Pipeline-specific metrics to complement DORA: The teams that deploy with confidence aren't running more sophisticated tools — they've internalized that the pipeline is a quality accelerator, not a box to check. Every practice in this guide exists because someone, somewhere, skipped it and paid the price. Start with the Phase 1 practices. Ship something this week. Measure your DORA metrics baseline. Add practices where the data shows pain. A CI/CD pipeline isn't a project you complete — it's a system you continuously improve. For teams deploying microservices, the deployment strategy section pairs closely with a microservices architecture guide that covers service-specific pipeline patterns. If you're running serverless infrastructure, the IaC section is particularly relevant to AWS Lambda and serverless pipelines. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

# GitHub: branch protection via API or repository settings # Require -weight: 500;">status checks before merging: required_status_checks: strict: true # require branch to be up to date contexts: - "ci/unit-tests" - "ci/lint" - "ci/security-scan" # Require pull request reviews: required_pull_request_reviews: required_approving_review_count: 1 dismiss_stale_reviews: true # Enforce for admins too — no emergency bypasses: enforce_admins: true # GitHub: branch protection via API or repository settings # Require -weight: 500;">status checks before merging: required_status_checks: strict: true # require branch to be up to date contexts: - "ci/unit-tests" - "ci/lint" - "ci/security-scan" # Require pull request reviews: required_pull_request_reviews: required_approving_review_count: 1 dismiss_stale_reviews: true # Enforce for admins too — no emergency bypasses: enforce_admins: true # GitHub: branch protection via API or repository settings # Require -weight: 500;">status checks before merging: required_status_checks: strict: true # require branch to be up to date contexts: - "ci/unit-tests" - "ci/lint" - "ci/security-scan" # Require pull request reviews: required_pull_request_reviews: required_approving_review_count: 1 dismiss_stale_reviews: true # Enforce for admins too — no emergency bypasses: enforce_admins: true # GitLab: protected branch settings in .gitlab-ci.yml context # Configure via Settings > Repository > Protected Branches: # Push: No one (merge requests only) # Merge: Maintainers # Code owner approval: Required # GitLab: protected branch settings in .gitlab-ci.yml context # Configure via Settings > Repository > Protected Branches: # Push: No one (merge requests only) # Merge: Maintainers # Code owner approval: Required # GitLab: protected branch settings in .gitlab-ci.yml context # Configure via Settings > Repository > Protected Branches: # Push: No one (merge requests only) # Merge: Maintainers # Code owner approval: Required # GitHub Actions: staged test execution jobs: fast-checks: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Lint run: -weight: 500;">npm run lint - name: Type check run: -weight: 500;">npm run type-check - name: Unit tests run: -weight: 500;">npm test -- --coverage --ci integration-tests: needs: fast-checks # only run if fast checks pass runs-on: ubuntu-latest services: postgres: image: postgres:16 env: POSTGRES_PASSWORD: test steps: - uses: actions/checkout@v4 - name: Integration tests run: -weight: 500;">npm run test:integration e2e-tests: needs: integration-tests runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: E2E tests run: npx playwright test --project=chromium # GitHub Actions: staged test execution jobs: fast-checks: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Lint run: -weight: 500;">npm run lint - name: Type check run: -weight: 500;">npm run type-check - name: Unit tests run: -weight: 500;">npm test -- --coverage --ci integration-tests: needs: fast-checks # only run if fast checks pass runs-on: ubuntu-latest services: postgres: image: postgres:16 env: POSTGRES_PASSWORD: test steps: - uses: actions/checkout@v4 - name: Integration tests run: -weight: 500;">npm run test:integration e2e-tests: needs: integration-tests runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: E2E tests run: npx playwright test --project=chromium # GitHub Actions: staged test execution jobs: fast-checks: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Lint run: -weight: 500;">npm run lint - name: Type check run: -weight: 500;">npm run type-check - name: Unit tests run: -weight: 500;">npm test -- --coverage --ci integration-tests: needs: fast-checks # only run if fast checks pass runs-on: ubuntu-latest services: postgres: image: postgres:16 env: POSTGRES_PASSWORD: test steps: - uses: actions/checkout@v4 - name: Integration tests run: -weight: 500;">npm run test:integration e2e-tests: needs: integration-tests runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: E2E tests run: npx playwright test --project=chromium # GitLab CI equivalent: stages: - fast-checks - integration - e2e lint-and-unit: stage: fast-checks script: - -weight: 500;">npm run lint - -weight: 500;">npm test -- --ci --coverage integration: stage: integration needs: ["lint-and-unit"] services: - postgres:16 script: - -weight: 500;">npm run test:integration e2e: stage: e2e needs: ["integration"] script: - npx playwright test # GitLab CI equivalent: stages: - fast-checks - integration - e2e lint-and-unit: stage: fast-checks script: - -weight: 500;">npm run lint - -weight: 500;">npm test -- --ci --coverage integration: stage: integration needs: ["lint-and-unit"] services: - postgres:16 script: - -weight: 500;">npm run test:integration e2e: stage: e2e needs: ["integration"] script: - npx playwright test # GitLab CI equivalent: stages: - fast-checks - integration - e2e lint-and-unit: stage: fast-checks script: - -weight: 500;">npm run lint - -weight: 500;">npm test -- --ci --coverage integration: stage: integration needs: ["lint-and-unit"] services: - postgres:16 script: - -weight: 500;">npm run test:integration e2e: stage: e2e needs: ["integration"] script: - npx playwright test # package.json or jest.config.js coverageThreshold: global: branches: 70 functions: 80 lines: 80 statements: 80 # package.json or jest.config.js coverageThreshold: global: branches: 70 functions: 80 lines: 80 statements: 80 # package.json or jest.config.js coverageThreshold: global: branches: 70 functions: 80 lines: 80 statements: 80 # GitHub Actions: Terraform validation pipeline jobs: terraform-validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: "~1.7" - name: Terraform format check run: terraform fmt -check -recursive working-directory: ./infrastructure - name: Terraform validate run: | terraform init -backend=false terraform validate working-directory: ./infrastructure - name: Terraform plan (PR only) if: github.event_name == 'pull_request' run: terraform plan -no-color working-directory: ./infrastructure env: TF_VAR_environment: staging - name: tfsec security scan uses: aquasecurity/tfsec-action@v1.0.0 with: working-directory: ./infrastructure # GitHub Actions: Terraform validation pipeline jobs: terraform-validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: "~1.7" - name: Terraform format check run: terraform fmt -check -recursive working-directory: ./infrastructure - name: Terraform validate run: | terraform init -backend=false terraform validate working-directory: ./infrastructure - name: Terraform plan (PR only) if: github.event_name == 'pull_request' run: terraform plan -no-color working-directory: ./infrastructure env: TF_VAR_environment: staging - name: tfsec security scan uses: aquasecurity/tfsec-action@v1.0.0 with: working-directory: ./infrastructure # GitHub Actions: Terraform validation pipeline jobs: terraform-validate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: "~1.7" - name: Terraform format check run: terraform fmt -check -recursive working-directory: ./infrastructure - name: Terraform validate run: | terraform init -backend=false terraform validate working-directory: ./infrastructure - name: Terraform plan (PR only) if: github.event_name == 'pull_request' run: terraform plan -no-color working-directory: ./infrastructure env: TF_VAR_environment: staging - name: tfsec security scan uses: aquasecurity/tfsec-action@v1.0.0 with: working-directory: ./infrastructure # Run terraform plan in "detect drift" mode (no changes allowed) terraform plan -detailed-exitcode # Exit code 2 means drift detected — alert the team # Run terraform plan in "detect drift" mode (no changes allowed) terraform plan -detailed-exitcode # Exit code 2 means drift detected — alert the team # Run terraform plan in "detect drift" mode (no changes allowed) terraform plan -detailed-exitcode # Exit code 2 means drift detected — alert the team # GitHub Actions: comprehensive security scanning stage jobs: security: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 # Dependency vulnerability scanning - name: Dependency audit run: -weight: 500;">npm audit --audit-level=high # SAST: static code analysis - name: CodeQL analysis uses: github/codeql-action/analyze@v3 with: languages: javascript # Secret scanning (prevent secrets from being committed) - name: Gitleaks secret scan uses: gitleaks/gitleaks-action@v2 # Container image scanning - name: Build and scan container run: | -weight: 500;">docker build -t app:${{ github.sha }} . -weight: 500;">docker run --rm \ -v /var/run/-weight: 500;">docker.sock:/var/run/-weight: 500;">docker.sock \ aquasec/trivy:latest image \ --exit-code 1 \ --severity CRITICAL \ app:${{ github.sha }} # GitHub Actions: comprehensive security scanning stage jobs: security: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 # Dependency vulnerability scanning - name: Dependency audit run: -weight: 500;">npm audit --audit-level=high # SAST: static code analysis - name: CodeQL analysis uses: github/codeql-action/analyze@v3 with: languages: javascript # Secret scanning (prevent secrets from being committed) - name: Gitleaks secret scan uses: gitleaks/gitleaks-action@v2 # Container image scanning - name: Build and scan container run: | -weight: 500;">docker build -t app:${{ github.sha }} . -weight: 500;">docker run --rm \ -v /var/run/-weight: 500;">docker.sock:/var/run/-weight: 500;">docker.sock \ aquasec/trivy:latest image \ --exit-code 1 \ --severity CRITICAL \ app:${{ github.sha }} # GitHub Actions: comprehensive security scanning stage jobs: security: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 # Dependency vulnerability scanning - name: Dependency audit run: -weight: 500;">npm audit --audit-level=high # SAST: static code analysis - name: CodeQL analysis uses: github/codeql-action/analyze@v3 with: languages: javascript # Secret scanning (prevent secrets from being committed) - name: Gitleaks secret scan uses: gitleaks/gitleaks-action@v2 # Container image scanning - name: Build and scan container run: | -weight: 500;">docker build -t app:${{ github.sha }} . -weight: 500;">docker run --rm \ -v /var/run/-weight: 500;">docker.sock:/var/run/-weight: 500;">docker.sock \ aquasec/trivy:latest image \ --exit-code 1 \ --severity CRITICAL \ app:${{ github.sha }} # GitLab CI: blue-green with AWS ALB deploy-green: stage: deploy script: - aws ecs -weight: 500;">update--weight: 500;">service --cluster prod ---weight: 500;">service app-green \ --task-definition app:$CI_PIPELINE_IID - aws ecs wait services-stable --cluster prod --services app-green - # Run smoke tests against green target group - ./scripts/smoke-test.sh $GREEN_URL - # Shift 100% traffic to green - aws elbv2 modify-rule --rule-arn $ALB_RULE_ARN \ --actions Type=forward,TargetGroupArn=$GREEN_TG_ARN only: - main # GitLab CI: blue-green with AWS ALB deploy-green: stage: deploy script: - aws ecs -weight: 500;">update--weight: 500;">service --cluster prod ---weight: 500;">service app-green \ --task-definition app:$CI_PIPELINE_IID - aws ecs wait services-stable --cluster prod --services app-green - # Run smoke tests against green target group - ./scripts/smoke-test.sh $GREEN_URL - # Shift 100% traffic to green - aws elbv2 modify-rule --rule-arn $ALB_RULE_ARN \ --actions Type=forward,TargetGroupArn=$GREEN_TG_ARN only: - main # GitLab CI: blue-green with AWS ALB deploy-green: stage: deploy script: - aws ecs -weight: 500;">update--weight: 500;">service --cluster prod ---weight: 500;">service app-green \ --task-definition app:$CI_PIPELINE_IID - aws ecs wait services-stable --cluster prod --services app-green - # Run smoke tests against green target group - ./scripts/smoke-test.sh $GREEN_URL - # Shift 100% traffic to green - aws elbv2 modify-rule --rule-arn $ALB_RULE_ARN \ --actions Type=forward,TargetGroupArn=$GREEN_TG_ARN only: - main # Canary: shift 5% traffic, monitor for 10 minutes, then full rollout deploy-canary: stage: canary script: - ./scripts/deploy-canary.sh --weight 5 - sleep 600 # 10 minute observation window - ./scripts/check-error-rate.sh --threshold 0.5 # fail if >0.5% errors - ./scripts/deploy-canary.sh --weight 100 # Canary: shift 5% traffic, monitor for 10 minutes, then full rollout deploy-canary: stage: canary script: - ./scripts/deploy-canary.sh --weight 5 - sleep 600 # 10 minute observation window - ./scripts/check-error-rate.sh --threshold 0.5 # fail if >0.5% errors - ./scripts/deploy-canary.sh --weight 100 # Canary: shift 5% traffic, monitor for 10 minutes, then full rollout deploy-canary: stage: canary script: - ./scripts/deploy-canary.sh --weight 5 - sleep 600 # 10 minute observation window - ./scripts/check-error-rate.sh --threshold 0.5 # fail if >0.5% errors - ./scripts/deploy-canary.sh --weight 100 # GitHub Actions: parallel test shards strategy: matrix: shard: [1, 2, 3, 4] # 4 parallel runners steps: - name: Run test shard run: npx jest --shard=${{ matrix.shard }}/4 # GitHub Actions: parallel test shards strategy: matrix: shard: [1, 2, 3, 4] # 4 parallel runners steps: - name: Run test shard run: npx jest --shard=${{ matrix.shard }}/4 # GitHub Actions: parallel test shards strategy: matrix: shard: [1, 2, 3, 4] # 4 parallel runners steps: - name: Run test shard run: npx jest --shard=${{ matrix.shard }}/4 # GitHub Actions: intelligent -weight: 500;">npm cache - name: Cache node modules uses: actions/cache@v4 with: path: ~/.-weight: 500;">npm key: ${{ runner.os }}--weight: 500;">npm-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}--weight: 500;">npm- # GitLab CI: cache: key: files: - package-lock.json paths: - node_modules/ # GitHub Actions: intelligent -weight: 500;">npm cache - name: Cache node modules uses: actions/cache@v4 with: path: ~/.-weight: 500;">npm key: ${{ runner.os }}--weight: 500;">npm-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}--weight: 500;">npm- # GitLab CI: cache: key: files: - package-lock.json paths: - node_modules/ # GitHub Actions: intelligent -weight: 500;">npm cache - name: Cache node modules uses: actions/cache@v4 with: path: ~/.-weight: 500;">npm key: ${{ runner.os }}--weight: 500;">npm-${{ hashFiles('**/package-lock.json') }} restore-keys: | ${{ runner.os }}--weight: 500;">npm- # GitLab CI: cache: key: files: - package-lock.json paths: - node_modules/ # Good: dependency layer (changes rarely) before app code layer (changes often) FROM node:22-slim WORKDIR /app COPY package*.json ./ RUN -weight: 500;">npm ci --only=production # this layer is cached unless package.json changes COPY src/ ./src/ # this layer rebuilds on every code change CMD ["node", "src/index.js"] # Good: dependency layer (changes rarely) before app code layer (changes often) FROM node:22-slim WORKDIR /app COPY package*.json ./ RUN -weight: 500;">npm ci --only=production # this layer is cached unless package.json changes COPY src/ ./src/ # this layer rebuilds on every code change CMD ["node", "src/index.js"] # Good: dependency layer (changes rarely) before app code layer (changes often) FROM node:22-slim WORKDIR /app COPY package*.json ./ RUN -weight: 500;">npm ci --only=production # this layer is cached unless package.json changes COPY src/ ./src/ # this layer rebuilds on every code change CMD ["node", "src/index.js"] # GitHub Actions: path filtering on: push: paths-ignore: - '**.md' - 'docs/**' # GitHub Actions: path filtering on: push: paths-ignore: - '**.md' - 'docs/**' # GitHub Actions: path filtering on: push: paths-ignore: - '**.md' - 'docs/**' # ArgoCD Application manifest — the pipeline updates this repo, # ArgoCD deploys it apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: api--weight: 500;">service namespace: argocd spec: project: production source: repoURL: https://github.com/your-org/k8s-manifests targetRevision: main path: apps/api--weight: 500;">service/production destination: server: https://kubernetes.default.svc namespace: api--weight: 500;">service syncPolicy: automated: prune: true selfHeal: true # re-apply if someone manually changes cluster state syncOptions: - CreateNamespace=true # ArgoCD Application manifest — the pipeline updates this repo, # ArgoCD deploys it apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: api--weight: 500;">service namespace: argocd spec: project: production source: repoURL: https://github.com/your-org/k8s-manifests targetRevision: main path: apps/api--weight: 500;">service/production destination: server: https://kubernetes.default.svc namespace: api--weight: 500;">service syncPolicy: automated: prune: true selfHeal: true # re-apply if someone manually changes cluster state syncOptions: - CreateNamespace=true # ArgoCD Application manifest — the pipeline updates this repo, # ArgoCD deploys it apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: api--weight: 500;">service namespace: argocd spec: project: production source: repoURL: https://github.com/your-org/k8s-manifests targetRevision: main path: apps/api--weight: 500;">service/production destination: server: https://kubernetes.default.svc namespace: api--weight: 500;">service syncPolicy: automated: prune: true selfHeal: true # re-apply if someone manually changes cluster state syncOptions: - CreateNamespace=true # GitHub Actions: annotate deployment in Datadog - name: Send deployment event to Datadog run: | -weight: 500;">curl -X POST "https://api.datadoghq.com/api/v1/events" \ -H "Content-Type: application/json" \ -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \ -d '{ "title": "Deployment: api--weight: 500;">service '${{ github.sha }}'", "text": "Deployed by ${{ github.actor }}", "tags": ["-weight: 500;">service:api--weight: 500;">service", "env:production", "source:ci"], "alert_type": "info" }' # GitHub Actions: annotate deployment in Datadog - name: Send deployment event to Datadog run: | -weight: 500;">curl -X POST "https://api.datadoghq.com/api/v1/events" \ -H "Content-Type: application/json" \ -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \ -d '{ "title": "Deployment: api--weight: 500;">service '${{ github.sha }}'", "text": "Deployed by ${{ github.actor }}", "tags": ["-weight: 500;">service:api--weight: 500;">service", "env:production", "source:ci"], "alert_type": "info" }' # GitHub Actions: annotate deployment in Datadog - name: Send deployment event to Datadog run: | -weight: 500;">curl -X POST "https://api.datadoghq.com/api/v1/events" \ -H "Content-Type: application/json" \ -H "DD-API-KEY: ${{ secrets.DATADOG_API_KEY }}" \ -d '{ "title": "Deployment: api--weight: 500;">service '${{ github.sha }}'", "text": "Deployed by ${{ github.actor }}", "tags": ["-weight: 500;">service:api--weight: 500;">service", "env:production", "source:ci"], "alert_type": "info" }' # Deployment script: record the current version before deploying PREVIOUS_VERSION=$(-weight: 500;">kubectl get deployment api--weight: 500;">service -o jsonpath='{.spec.template.spec.containers[0].image}') echo "PREVIOUS_VERSION=$PREVIOUS_VERSION" >> $GITHUB_ENV # Automated rollback triggered by error rate threshold if ./scripts/check-health.sh --timeout 300 --error-threshold 1; then echo "Deploy successful" else echo "Health check failed — rolling back" -weight: 500;">kubectl set image deployment/api--weight: 500;">service api=$PREVIOUS_VERSION exit 1 fi # Deployment script: record the current version before deploying PREVIOUS_VERSION=$(-weight: 500;">kubectl get deployment api--weight: 500;">service -o jsonpath='{.spec.template.spec.containers[0].image}') echo "PREVIOUS_VERSION=$PREVIOUS_VERSION" >> $GITHUB_ENV # Automated rollback triggered by error rate threshold if ./scripts/check-health.sh --timeout 300 --error-threshold 1; then echo "Deploy successful" else echo "Health check failed — rolling back" -weight: 500;">kubectl set image deployment/api--weight: 500;">service api=$PREVIOUS_VERSION exit 1 fi # Deployment script: record the current version before deploying PREVIOUS_VERSION=$(-weight: 500;">kubectl get deployment api--weight: 500;">service -o jsonpath='{.spec.template.spec.containers[0].image}') echo "PREVIOUS_VERSION=$PREVIOUS_VERSION" >> $GITHUB_ENV # Automated rollback triggered by error rate threshold if ./scripts/check-health.sh --timeout 300 --error-threshold 1; then echo "Deploy successful" else echo "Health check failed — rolling back" -weight: 500;">kubectl set image deployment/api--weight: 500;">service api=$PREVIOUS_VERSION exit 1 fi - Deployment frequency: How often you can reliably release - Lead time for changes: Time from code commit to production - Change failure rate: Percentage of deployments causing incidents - Mean time to recovery (MTTR): How fast you resolve incidents - Unit tests — fast (milliseconds each), isolated, run on every commit - Integration tests — test component boundaries, run on every PR - E2E tests — validate critical paths only, run pre-deploy - [ ] Enable branch protection: require PR reviews and -weight: 500;">status checks - [ ] Add linting and static analysis to CI (catches the fastest category of bugs) - [ ] Run unit tests on every commit - [ ] Add secret scanning (this is cheap to implement and the risk of not having it is severe) - [ ] Add integration tests with test environment services - [ ] Implement dependency caching - [ ] Add dependency vulnerability scanning - [ ] Implement automated deployment to staging on merge to main - [ ] Implement canary releases or blue-green deployment - [ ] Add container security scanning - [ ] Set up deployment event tracking in your observability stack - [ ] Implement GitOps if on Kubernetes - [ ] Build DORA metrics dashboard - Impact on DORA metrics: Does this directly improve deployment frequency, lead time, failure rate, or MTTR? - Implementation complexity: How long does it take to implement and maintain? - Mean pipeline duration (trend: should be flat or decreasing) - Pipeline success rate (trend: should be increasing) - Flaky test rate (trend: should be decreasing toward zero) - Time spent waiting for review (identifies bottlenecks in the human parts of the pipeline)