Tools: Update: Our Terraform Drift Went Undetected for Four Months. Here Is How We Found It.

Tools: Update: Our Terraform Drift Went Undetected for Four Months. Here Is How We Found It.

How It Started

How We Found It

What We Did to Untangle It

What We Put In Place After

Where We Are Now In my last post I talked about state file corruption and the mess that comes with it. This one is about a quieter problem that lived alongside it for months without us noticing. Infrastructure drift. Drift is what happens when the actual state of your cloud resources stops matching what your Terraform code says they should be. It can happen for obvious reasons like a developer tweaking a security group rule in the console during an incident. But it can also accumulate slowly and invisibly over weeks through automated processes, manual patches, and one-off fixes that never made it back into code. That second kind is what got us. We had been running a mix of production and staging workloads across three AWS accounts since early 2024. Our Terraform setup was the standard arrangement: S3 remote state, DynamoDB locking, workspaces for environment separation. The team was disciplined about infrastructure changes for the most part. PRs went through review, applies were done through the pipeline, and we had a rule that console changes were for emergencies only. The rule held about 80% of the time. The other 20% was what caused the problem. During incidents people make console changes because it is faster than writing Terraform, opening a PR and waiting for a pipeline. A security group rule gets widened to unblock something at 11pm. An EC2 instance gets a tag added so billing can track a cost spike. A CloudWatch alarm threshold gets bumped because it was firing too often. None of these feel significant in the moment. None of them get turned into Terraform code afterwards because the incident is over and there are other things to do. Over four months that 20% added up to 47 individual resources across our three AWS accounts that had diverged from what our Terraform code described. We didn't know this until a routine audit in January 2026. We found it by accident. We were running a terraform plan on our staging environment to test a new module we were writing. The plan came back with a wall of changes that had nothing to do with the module we were adding. Resources we hadn't touched in months were showing up as needing modifications. Some showed diffs on fields we didn't recognise changing. One showed a resource that Terraform thought existed but was actually gone. We ran terraform plan across our other workspaces and found the same pattern everywhere. The cumulative drift from four months of incident-time console changes was sitting there waiting for the next apply to either overwrite it or explode trying. The dangerous part was not the drift itself. It was the fact that some of the drifted state was intentional. The security group rule that was widened during that 11pm incident had stayed widened because widening it had actually fixed the underlying problem. If we had blindly run terraform apply it would have put that rule back to its original tighter setting and broken production again. Reconciling four months of drift manually was unpleasant. We went resource by resource through the plan output and for each differing resource asked two questions. Is the current live state correct or is the Terraform code correct? Who changed it and when? For about 60% of the drifted resources the live state was correct and the Terraform code needed updating to match it. For 30% the Terraform code was correct and the live change was genuinely stale and could be overwritten. For the remaining 10% we genuinely couldn't tell and had to track down the people involved to get context. The reconciliation took three engineers two full days. At the end of it we ran a clean plan across all environments and saw no unexpected changes for the first time in months. We came out of this with three changes that have stuck. Drift detection on a schedule We set up a GitHub Actions workflow that runs terraform plan in read-only mode across all our workspaces every night and posts a summary to a Slack channel. If the plan is clean it posts a green check. If there are unexpected diffs it posts a list of the affected resources. The first week it ran we caught two new drifts within 24 hours of them happening instead of four months later. The feedback loop being that tight changes how people think about console changes because they know it'll show up in Slack the next morning. A proper incident infrastructure runbook The root cause of most of our drift was that engineers had no fast legitimate path for making temporary infrastructure changes during incidents. Console was faster than Terraform so console won. We wrote a short runbook for incident-time infrastructure changes. The runbook says: if you make a console change during an incident you open a follow-up ticket before the incident is closed, and that ticket must include a Terraform PR to codify the change within 48 hours. The on-call engineer who closes the incident is responsible for making sure the ticket exists. It is not a technical solution. It is a process one. But it has worked better than we expected because it does not add friction to the incident itself, it just requires a small step at close-out when the pressure is off. Tagging console-created resources We turned on an AWS Config rule that flags any resource created without a specific terraform-managed tag. Resources created through our pipelines get the tag automatically. Resources created through the console do not. This gives us a live view in Config of anything in our accounts that has no corresponding Terraform management. It does not prevent console changes but it makes them visible immediately rather than discoverable only when someone runs a plan. Four months after putting these in place we have had zero multi-week drift go undetected. The nightly plan check has caught eight individual drifts. Six of those were small tag changes. One was a security group modification. One was a manually created S3 bucket that a developer had spun up to test something and forgotten about. None of them had time to compound into the four-month backlog we had to deal with in January. The honest truth is that drift is not a Terraform problem. It is a team behaviour problem that Terraform exposes. The technical tooling helps but what actually fixed it for us was making drift visible quickly enough that it stayed a small problem instead of growing into a large one. If you are not running regular plan checks in your environments you are almost certainly accumulating drift right now. The question is just whether you will find it in days or months. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Code Block

Copy

name: Drift Detection on: schedule: - cron: '0 6 * * *' jobs: detect-drift: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 - name: Run Plan run: | terraform init terraform plan -detailed-exitcode 2>&1 | tee plan-output.txt env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - name: Notify Slack if: failure() uses: slackapi/slack-github-action@v1 with: payload: | {"text": "Terraform drift detected in ${{ github.workflow }}. Check the Actions run for details."} env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} name: Drift Detection on: schedule: - cron: '0 6 * * *' jobs: detect-drift: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 - name: Run Plan run: | terraform init terraform plan -detailed-exitcode 2>&1 | tee plan-output.txt env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - name: Notify Slack if: failure() uses: slackapi/slack-github-action@v1 with: payload: | {"text": "Terraform drift detected in ${{ github.workflow }}. Check the Actions run for details."} env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }} name: Drift Detection on: schedule: - cron: '0 6 * * *' jobs: detect-drift: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 - name: Run Plan run: | terraform init terraform plan -detailed-exitcode 2>&1 | tee plan-output.txt env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} - name: Notify Slack if: failure() uses: slackapi/slack-github-action@v1 with: payload: | {"text": "Terraform drift detected in ${{ github.workflow }}. Check the Actions run for details."} env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}