TLA+ Caught the Bug We Shipped

How a 200-line formal spec found a real deployment bug in 0.8 seconds that five rounds of manual testing missed.

The bug

Two FreeBSD machines. Two GPUs. One Erlang cluster. We built a deployment tool called Zed that converges ZFS datasets across hosts via :rpc.call. The five-phase runbook went like this:

Bootstrap both Macs ✓
Connect via Erlang distribution ✓
Single-host converge ✓
Two-host coordinated converge ✓
Chaos test: make one host fail, verify the other rolls back

Phase 5 failed. Not the test — the system.

Host	Dataset	Expected	Actual
mac-248	`chaos-good`	rolled back	EXISTS (NOT rolled back)
mac-247	`chaos-bad`	never created	never created

When mac-247’s converge failed (intentional — we set a quota of 1KB), mac-248’s successful converge was not undone. The dataset chaos-good survived on mac-248. Partial state. The one thing a coordinated deployment tool must never leave behind.

The implementation looked reasonable:

def converge_coordinated(ir, opts) do
  results = Enum.reduce_while(nodes, {:ok, []}, fn node, {:ok, succeeded} ->
    case converge(node, ir, opts) do
      {:ok, _} = result -> {:cont, {:ok, [{node, result} | succeeded]}}
      {:error, _, _, _} = error -> {:halt, {:error, node, error, succeeded}}
    end
  end)

  case results do
    {:ok, succeeded} -> {:ok, Map.new(succeeded)}
    {:error, failed_node, error, succeeded} ->
      # Rollback succeeded nodes
      rollback_results = Enum.map(succeeded, fn {node, _} ->
        {node, rollback(node, ir, "@latest")}
      end)
      {:error, :partial_failure, %{rolled_back: rollback_results}}
  end
end

See it? The rollback calls rollback(node, ir, "@latest") — but @latest assumes a pre-converge snapshot exists. For datasets that were created (not modified), there is no snapshot to roll back to. The rollback silently fails. The dataset survives.

This is the kind of bug that unit tests don’t catch, integration tests don’t catch, and code review doesn’t catch — because it only manifests in the interaction between two concurrent state machines on different hosts with different prior states.

The spec

We wrote a TLA+ specification in 200 lines. Not after the bug. Not as documentation. As a design tool — to figure out what the protocol should be before implementing the fix.

The key insight: each host’s dataset can be in one of two prior states before the protocol starts.

Init ==
    /\ phase = "idle"
    /\ hostState \in [Hosts -> {"absent", "clean"}]

TLC, the TLA+ model checker, explores all combinations. With two hosts, each dataset either absent or clean, that’s 4 initial states × every possible interleaving of prepare/converge/rollback actions.

The protocol has three phases.

Phase 1 — Prepare. For each host: if the dataset exists, snapshot it (rollback target). If absent, mark it for create (rollback = destroy).

PrepareExisting(h) ==
    /\ phase = "prepare"
    /\ hostState[h] = "clean"
    /\ hostState' = [hostState EXCEPT ![h] = "prepared"]
    /\ hostSnapshot' = [hostSnapshot EXCEPT ![h] = "pre-converge"]
    /\ convergeAction' = [convergeAction EXCEPT ![h] = "modify"]

PrepareAbsent(h) ==
    /\ phase = "prepare"
    /\ hostState[h] = "absent"
    /\ hostState' = [hostState EXCEPT ![h] = "prepared"]
    /\ hostSnapshot' = [hostSnapshot EXCEPT ![h] = "none"]
    /\ convergeAction' = [convergeAction EXCEPT ![h] = "create"]

Phase 2 — Converge. Each host either succeeds or fails, independently. After all report, if any failed → rollback ALL.

Phase 3 — Rollback. The action determines the method:

RollbackModified(h) ==
    /\ hostState[h] = "converged"
    /\ convergeAction[h] = "modify"
    /\ hostSnapshot[h] = "pre-converge"
    \* → zfs rollback to snapshot

RollbackCreated(h) ==
    /\ hostState[h] = "converged"
    /\ convergeAction[h] = "create"
    /\ hostSnapshot[h] = "none"
    \* → zfs destroy (undo the create)

The invariant

The property the bug violated, stated precisely:

NoPartialState ==
    phase \in {"done", "failed"} =>
        \/ \A h \in Hosts : hostState[h] = "verified"
        \/ \A h \in Hosts : hostState[h] = "rolled_back"

At termination, either all hosts are verified (success) or all are rolled back (failure). No host is left in “converged” while another is “failed”. No partial state. Ever.

We added a second invariant specific to the create/modify distinction:

RollbackMatchesAction ==
    \A h \in Hosts :
        (hostState[h] = "rolled_back" /\ convergeAction[h] = "modify")
            => hostSnapshot[h] = "pre-converge"

If a host was rolled back and its action was “modify”, it must have had a pre-converge snapshot. This rules out the original bug’s failure mode: trying to zfs rollback on a dataset that was created (no snapshot) instead of zfs destroy.

The numbers

$ java -jar tla2tools.jar -config CoordinatedConverge.cfg CoordinatedConverge.tla

172 states generated, 124 distinct states found, 0 states left on queue.
Model checking completed. No error has been found.
Finished in 00s.

172 states. 124 distinct. Every interleaving of 2 hosts × 2 prior states × success/failure × rollback paths. Checked in under a second on a 2013 Mac Pro.

Zero errors. All three invariants hold. The protocol is correct.

The implementation

The Elixir implementation mirrors the spec. Three functions, three phases:

defp do_coordinated_converge(targets, opts) do
  case prepare_all(targets) do
    {:ok, preparations} ->
      results = converge_all_targets(targets, opts)
      failed = Enum.filter(results, fn {_, r} -> not match?({:ok, _}, r) end)

      if failed == [] do
        {:ok, Map.new(results)}
      else
        rollback_results = rollback_all(targets, preparations)
        {:error, :partial_failure, %{failed: failed, rolled_back: rollback_results}}
      end

    {:error, _} = err -> err
  end
end

prepare_all checks Dataset.exists? on each host via RPC. Existing datasets get snapshotted; absent ones are tagged :create. rollback_all reads the tag and dispatches either Snapshot.rollback (modify) or Dataset.destroy (create).

The re-test

Same chaos test. Same two Macs. Same intentional failure.

Phase 1 (prepare): both datasets absent → marked as action: :create
Phase 2 (converge): mac-248 succeeds, mac-247 fails (quota 1K)
Phase 3 (rollback): mac-248's chaos-good DESTROYED (create rollback)

chaos-good exists after rollback: false
*** P0 FIX CONFIRMED ***

The full runbook, clean pass:

R1: ✓ bootstrap          — 8 slots, fingerprints on both Macs
R2: ✓ distribution       — 15ms RPC round-trip
R3: ✓ single-host        — 42ms converge
R4: ✓ two-host           — 100ms coordinated converge
R5: ✓ P0 fix holds       — 100ms chaos, rollback destroys created datasets

What TLA+ is and isn’t

TLA+ is not a test framework. It doesn’t run your code. It doesn’t mock your dependencies. It doesn’t generate test cases (though it can inform them).

TLA+ is a design tool. It lets you state what your protocol should do (the invariants) and then exhaustively checks every possible execution ordering to see if it does. The model checker finds counterexamples — specific sequences of actions that violate the invariant.

TLA+ is fast. Our 2-host protocol has 172 reachable states. The model checker verifies all of them in under a second. A 3-host protocol would have ~1000 states — still under a second. You hit minutes at 5+ hosts with complex state, and even then you’re checking things no amount of testing could cover.

TLA+ doesn’t replace testing. We still ran the runbook on real hardware. The spec proved the protocol is correct; the runbook proved the implementation matches the spec on real ZFS, real Erlang distribution, real FreeBSD.

What we’d do differently

We wrote the TLA+ spec after finding the bug. The spec took 45 minutes. The bug took four hours to surface (5 runbook phases, two machines, multiple ZFS permission issues before we even got to R5).

If we’d written the spec first — before the first line of converge_coordinated — the spec would have forced us to answer: “what happens when a dataset doesn’t exist yet?” The Init state hostState \in [Hosts -> {"absent", "clean"}] makes the question inescapable. The model checker would have found the counterexample (host A absent + created + success, host B absent + create failed → host A not rolled back) before we wrote any Elixir.

45 minutes of TLA+ would have saved 4 hours of debugging.

The takeaway

Formal methods aren’t for academics. They’re for engineers who ship multi-host stateful systems and don’t want to discover their rollback protocol is broken at 2am on a Saturday.

The spec is 200 lines. The model checker runs in a second. The bug it would have caught cost us half a day. The math is simple.

$ wc -l specs/CoordinatedConverge.tla
200

$ time java -jar tla2tools.jar -config CoordinatedConverge.cfg CoordinatedConverge.tla
172 states generated, 124 distinct states found, 0 states left on queue.
Model checking completed. No error has been found.
real    0m0.8s

200 lines. 0.8 seconds. Zero errors. One bug caught that five rounds of manual testing missed.

Write the spec first.

Built with TLA+ (Lamport, 1999), TLC model checker (v2026), Elixir 1.18, Erlang/OTP 27, ZFS, FreeBSD 15.0, and two Mac Pros from 2013.