Deployment & Infra DEP · 05 · 01

Infrastructure as Code: the plan, the state file, and the drift

IaC declares desired infrastructure in version-controlled code; the tool diffs it against a recorded state file and computes a plan. That state file is both the source of truth and the thing that ends an incident when two applies race or someone clicks the console.

DEP Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

Two engineers ship a hotfix on Friday. Both run terraform apply from their laptops against the same project, ninety seconds apart. There is no remote lock — state lives in an S3 bucket but nobody wired up the lock table. The second apply reads a state file the first has not finished writing, computes a plan against a half-written reality, and recreates a load balancer that already existed. Production drops connections for four minutes, and the terraform.tfstate in the bucket now disagrees with what AWS actually has. The outage was not a bad config. It was two writers and no lock.

By the end of this lesson you will know exactly why the state file causes that outage and what three disciplines keep it from happening to you.

Declarative: you describe the destination, not the route

Why does this matter to you as an engineer? Because every manual click in the console is invisible to the rest of the team — no PR, no diff, no blame history. IaC fixes that.

Clicking through a cloud console is imperative — you perform steps, and the result lives only in the provider and in your memory. Infrastructure as Code flips this: you write the desired end state in version-controlled files (HCL for Terraform/OpenTofu, real TypeScript/Go/Python for Pulumi), and the tool figures out the steps to get there. You do not say “create a VPC, then a subnet, then a NAT gateway.” You declare that those resources exist with these properties, and the tool computes the dependency graph and the order.

This is the whole payoff. Because the desired state is text in git, your environments become reproducible (spin up an identical staging from the same module), reviewable (infra changes go through pull requests like any code), and auditable (the diff and the blame history say who changed what and when). The console gives you none of that — a change there leaves no review, no diff, no history beyond a thin audit log.

The price of declarative is idempotency: running apply twice must converge to the same result, not stack up duplicates. The tool guarantees this only because it remembers what it already built — which is exactly what the state file is for.

The plan/apply cycle is a diff engine

The core loop is two commands. terraform plan reads three things — your config (desired state), the recorded state file (what the tool last built), and reality (it refreshes by querying the provider) — and prints the diff: what it will create, change, or destroy. Nothing happens yet. terraform apply executes that diff and writes the new reality back into the state file.

Read that again: plan and apply both run a refresh first, reconciling the state file against the live provider before computing anything. That refresh step is where drift surfaces, and it is why the state file is not optional bookkeeping — it is the map from your config’s symbolic names (aws_lb.web) to real provider IDs (arn:aws:elasticloadbalancing:...). Lose that map and the tool no longer knows which real resource your code refers to.

Input to `plan`	What it represents	If it is wrong…
Config (`.tf` files)	Desired state — where you want to be	Plan proposes the wrong change; caught in PR review
State file (`terraform.tfstate`)	Last-known reality + the id↔resource map	Tool loses track of real resources → recreates or orphans them
Refresh (live provider query)	Actual reality, right now	Drift appears in the plan as unexpected changes

The state file is the heart and the hazard

Everything good about IaC routes through the state file, and so does everything dangerous. Three properties make a senior treat it with care.

First, it can hold secrets in plaintext. Terraform, OpenTofu, and Pulumi all serialize resource attributes into state — so if a database password, an API key, or a generated certificate is an output or an attribute, it sits unencrypted in the file by default. Anyone with read access to that bucket has your secrets. The mitigation is to never route secrets through state as outputs: write them straight to a secrets manager during apply and have apps fetch them at runtime. OpenTofu added built-in state encryption to harden this; Pulumi encrypts secret values per-stack as a first-class feature.

Second, it must live in a remote backend, not on a laptop. Local state means one person owns reality and the team can’t collaborate. The standard is a remote backend — S3, GCS, an OpenTofu/Terraform cloud backend — with versioning on, so a corrupted or truncated state can be rolled back to the last good version.

Third, it must be locked. This is the one that ends incidents.

Together these three properties mean the state file demands the same care as a production database — miss any one of them and you get the outage from the Hook, a leaked credential, or a half-written state you can’t recover.

Locking: why concurrent applies corrupt state

A write to the state file is not atomic across a team. If two applies run at once, both read the old state, both compute plans against it, and both write back — the second clobbers the first, and now the file describes neither reality. The fix is a lock: before any write operation, the backend acquires an exclusive lock (for S3, the native lockfile via use_lockfile = true is now the default path, with DynamoDB still valid as the legacy mechanism), and any second apply waits or fails fast with Error acquiring the state lock instead of racing.

The senior knows the sharp edges here. terraform apply -lock=false disables this and is how you reproduce the Hook outage on purpose. terraform force-unlock exists for stale locks left by a crashed run — but running it while another apply is actively writing leaves state half-updated and corrupt. The discipline in CI: a concurrency group so jobs never overlap, a -lock-timeout (say 10m) so legitimate in-flight runs are waited on rather than failed, and a plan after any forced unlock to verify state is consistent before the next apply.

▸Why this works

“Why not just diff against the live cloud every time and skip the state file?” Because a refresh only tells you the current attributes of resources the tool already knows about — it has no way to know that the load balancer named web in your account is the one your aws_lb.web block manages, versus one created by hand or by another team. The state file is the identity map. Without it, plan can’t tell “change this resource” from “create a new one,” which is precisely how a missing state file leads to duplicate infrastructure.

Drift: when reality wanders off

Drift is when the real world diverges from the state file — almost always because someone changed infra out-of-band (a console click, an emergency aws cli patch, a different tool). The next plan refreshes, sees the difference, and reports it. Now you face a senior judgment call: is the manual change correct (then update your config to match, so the next apply doesn’t revert it) or is it unwanted (then let apply restore the declared state)?

The trap is the silent revert. When you see an unexpected change in a plan, resist the reflex to apply immediately — that plan might be erasing a colleague’s emergency fix. Someone hand-bumps a security group rule in an incident at 2am; nobody updates the code; a routine apply on Tuesday quietly removes the rule because it isn’t in the desired state. IaC will always fight drift back to the declaration — that is the feature, and the foot-gun. Detect it deliberately with terraform plan -refresh-only (safer than the bare refresh, which overwrites state without showing you), ideally on a schedule, so drift is reviewed before an apply silently resolves it.

The deeper cure is immutable infrastructure: stop hand-mutating servers and instead replace them — bake a new image, roll it out, destroy the old. When nothing is mutated in place, there is far less surface for drift, and rollback is “deploy the previous image” instead of “remember every manual tweak.”

Pick the best fit

A teammate ran a manual console fix during an incident. The next terraform plan now shows a change reverting it. What does a senior do?

Quiz

Two engineers run terraform apply against the same project at the same time, with no state lock configured. What is the core risk?

Quiz

Why must the state file usually live in a remote backend rather than on an engineer's laptop?

Order the steps

Order what happens during a single safe terraform apply:

1 Acquire the remote state lock so no other apply can write concurrently
2 Refresh: query the live provider to update the state file with current reality
3 Compute the diff between desired config and refreshed state — the plan
4 Execute the plan against the provider, creating/changing/destroying resources
5 Write the new reality back into the state file and release the lock

plan diffs your desired config against the recorded state (refreshed against the live provider); apply makes the cloud match and writes the new reality back into the locked state file.

Recall before you leave

01
Explain to a teammate why the state file is both the source of truth and the biggest hazard in a Terraform setup.
02
What is drift, how do you detect it safely, and why can a routine apply make it dangerous?

Recap

Infrastructure as Code replaces console clicking with version-controlled declarations of desired state, making environments reproducible, reviewable, and auditable. The engine is a diff: plan refreshes against the live provider, compares your config to the recorded state file, and shows what it will create, change, or destroy; apply executes that diff and writes the new reality back. The state file is the identity map from your config to real resource IDs — which makes it the source of truth and, equally, the hazard: it can hold secrets in plaintext, it can be corrupted, and concurrent writes race. So it lives in a versioned, locked remote backend, never carries secrets you could instead push to a secrets manager, and is treated like a production database. Drift — reality wandering off after a manual change — surfaces in the next plan; detect it deliberately with refresh-only and decide intent before applying, because IaC will always reconcile back to the declaration and can silently revert an emergency fix. Lean toward immutable infrastructure so there is less to drift in the first place. Now when you see an unexpected change in a plan after an incident, you know to stop and ask: who made that change, should it stay, and does the config say so?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.