How a 200-line formal spec found a real deployment bug in 0.8 seconds that five rounds of manual testing missed.
The bug
Two FreeBSD machines. Two GPUs. One Erlang cluster. We built a
deployment tool called Zed
that converges ZFS datasets across hosts via :rpc.call.
The five-phase runbook went like this:
- Bootstrap both Macs ✓
- Connect via Erlang distribution ✓
- Single-host converge ✓
- Two-host coordinated converge ✓
- Chaos test: make one host fail, verify the other rolls back
Phase 5 failed. Not the test — the system.
| Host | Dataset | Expected | Actual |
|---|---|---|---|
| mac-248 | chaos-good | rolled back | EXISTS (NOT rolled back) |
| mac-247 | chaos-bad | never created | never created |
When mac-247’s converge failed (intentional — we set a
quota of 1KB), mac-248’s successful converge was not undone.
The dataset chaos-good survived on mac-248. Partial
state. The one thing a coordinated deployment tool must never
leave behind.
The implementation looked reasonable:
def converge_coordinated(ir, opts) do
results = Enum.reduce_while(nodes, {:ok, []}, fn node, {:ok, succeeded} ->
case converge(node, ir, opts) do
{:ok, _} = result -> {:cont, {:ok, [{node, result} | succeeded]}}
{:error, _, _, _} = error -> {:halt, {:error, node, error, succeeded}}
end
end)
case results do
{:ok, succeeded} -> {:ok, Map.new(succeeded)}
{:error, failed_node, error, succeeded} ->
# Rollback succeeded nodes
rollback_results = Enum.map(succeeded, fn {node, _} ->
{node, rollback(node, ir, "@latest")}
end)
{:error, :partial_failure, %{rolled_back: rollback_results}}
end
end
See it? The rollback calls rollback(node, ir, "@latest")
— but @latest assumes a pre-converge snapshot
exists. For datasets that were created (not modified),
there is no snapshot to roll back to. The rollback silently fails.
The dataset survives.
This is the kind of bug that unit tests don’t catch, integration tests don’t catch, and code review doesn’t catch — because it only manifests in the interaction between two concurrent state machines on different hosts with different prior states.
The spec
We wrote a TLA+ specification in 200 lines. Not after the bug. Not as documentation. As a design tool — to figure out what the protocol should be before implementing the fix.
The key insight: each host’s dataset can be in one of two prior states before the protocol starts.
Init ==
/\ phase = "idle"
/\ hostState \in [Hosts -> {"absent", "clean"}]
TLC, the TLA+ model checker, explores all combinations. With two hosts, each dataset either absent or clean, that’s 4 initial states × every possible interleaving of prepare/converge/rollback actions.
The protocol has three phases.
Phase 1 — Prepare. For each host: if the dataset exists, snapshot it (rollback target). If absent, mark it for create (rollback = destroy).
PrepareExisting(h) ==
/\ phase = "prepare"
/\ hostState[h] = "clean"
/\ hostState' = [hostState EXCEPT ![h] = "prepared"]
/\ hostSnapshot' = [hostSnapshot EXCEPT ![h] = "pre-converge"]
/\ convergeAction' = [convergeAction EXCEPT ![h] = "modify"]
PrepareAbsent(h) ==
/\ phase = "prepare"
/\ hostState[h] = "absent"
/\ hostState' = [hostState EXCEPT ![h] = "prepared"]
/\ hostSnapshot' = [hostSnapshot EXCEPT ![h] = "none"]
/\ convergeAction' = [convergeAction EXCEPT ![h] = "create"]
Phase 2 — Converge. Each host either succeeds or fails, independently. After all report, if any failed → rollback ALL.
Phase 3 — Rollback. The action determines the method:
RollbackModified(h) ==
/\ hostState[h] = "converged"
/\ convergeAction[h] = "modify"
/\ hostSnapshot[h] = "pre-converge"
\* → zfs rollback to snapshot
RollbackCreated(h) ==
/\ hostState[h] = "converged"
/\ convergeAction[h] = "create"
/\ hostSnapshot[h] = "none"
\* → zfs destroy (undo the create)
The invariant
The property the bug violated, stated precisely:
NoPartialState ==
phase \in {"done", "failed"} =>
\/ \A h \in Hosts : hostState[h] = "verified"
\/ \A h \in Hosts : hostState[h] = "rolled_back"
At termination, either all hosts are verified (success) or all are rolled back (failure). No host is left in “converged” while another is “failed”. No partial state. Ever.
We added a second invariant specific to the create/modify distinction:
RollbackMatchesAction ==
\A h \in Hosts :
(hostState[h] = "rolled_back" /\ convergeAction[h] = "modify")
=> hostSnapshot[h] = "pre-converge"
If a host was rolled back and its action was “modify”,
it must have had a pre-converge snapshot. This rules out the
original bug’s failure mode: trying to zfs rollback
on a dataset that was created (no snapshot) instead of
zfs destroy.
The numbers
$ java -jar tla2tools.jar -config CoordinatedConverge.cfg CoordinatedConverge.tla
172 states generated, 124 distinct states found, 0 states left on queue.
Model checking completed. No error has been found.
Finished in 00s.
172 states. 124 distinct. Every interleaving of 2 hosts × 2 prior states × success/failure × rollback paths. Checked in under a second on a 2013 Mac Pro.
Zero errors. All three invariants hold. The protocol is correct.
The implementation
The Elixir implementation mirrors the spec. Three functions, three phases:
defp do_coordinated_converge(targets, opts) do
case prepare_all(targets) do
{:ok, preparations} ->
results = converge_all_targets(targets, opts)
failed = Enum.filter(results, fn {_, r} -> not match?({:ok, _}, r) end)
if failed == [] do
{:ok, Map.new(results)}
else
rollback_results = rollback_all(targets, preparations)
{:error, :partial_failure, %{failed: failed, rolled_back: rollback_results}}
end
{:error, _} = err -> err
end
end
prepare_all checks Dataset.exists? on each
host via RPC. Existing datasets get snapshotted; absent ones are
tagged :create. rollback_all reads the tag
and dispatches either Snapshot.rollback (modify) or
Dataset.destroy (create).
The re-test
Same chaos test. Same two Macs. Same intentional failure.
Phase 1 (prepare): both datasets absent → marked as action: :create
Phase 2 (converge): mac-248 succeeds, mac-247 fails (quota 1K)
Phase 3 (rollback): mac-248's chaos-good DESTROYED (create rollback)
chaos-good exists after rollback: false
*** P0 FIX CONFIRMED ***
The full runbook, clean pass:
R1: ✓ bootstrap — 8 slots, fingerprints on both Macs
R2: ✓ distribution — 15ms RPC round-trip
R3: ✓ single-host — 42ms converge
R4: ✓ two-host — 100ms coordinated converge
R5: ✓ P0 fix holds — 100ms chaos, rollback destroys created datasets
What TLA+ is and isn’t
TLA+ is not a test framework. It doesn’t run your code. It doesn’t mock your dependencies. It doesn’t generate test cases (though it can inform them).
TLA+ is a design tool. It lets you state what your protocol should do (the invariants) and then exhaustively checks every possible execution ordering to see if it does. The model checker finds counterexamples — specific sequences of actions that violate the invariant.
TLA+ is fast. Our 2-host protocol has 172 reachable states. The model checker verifies all of them in under a second. A 3-host protocol would have ~1000 states — still under a second. You hit minutes at 5+ hosts with complex state, and even then you’re checking things no amount of testing could cover.
TLA+ doesn’t replace testing. We still ran the runbook on real hardware. The spec proved the protocol is correct; the runbook proved the implementation matches the spec on real ZFS, real Erlang distribution, real FreeBSD.
What we’d do differently
We wrote the TLA+ spec after finding the bug. The spec took 45 minutes. The bug took four hours to surface (5 runbook phases, two machines, multiple ZFS permission issues before we even got to R5).
If we’d written the spec first — before the first line
of converge_coordinated — the spec would have
forced us to answer: “what happens when a dataset doesn’t
exist yet?” The Init state
hostState \in [Hosts -> {"absent", "clean"}] makes
the question inescapable. The model checker would have found the
counterexample (host A absent + created + success, host B absent
+ create failed → host A not rolled back) before we wrote any
Elixir.
45 minutes of TLA+ would have saved 4 hours of debugging.
The takeaway
Formal methods aren’t for academics. They’re for engineers who ship multi-host stateful systems and don’t want to discover their rollback protocol is broken at 2am on a Saturday.
The spec is 200 lines. The model checker runs in a second. The bug it would have caught cost us half a day. The math is simple.
$ wc -l specs/CoordinatedConverge.tla
200
$ time java -jar tla2tools.jar -config CoordinatedConverge.cfg CoordinatedConverge.tla
172 states generated, 124 distinct states found, 0 states left on queue.
Model checking completed. No error has been found.
real 0m0.8s
200 lines. 0.8 seconds. Zero errors. One bug caught that five rounds of manual testing missed.
Write the spec first.
Built with TLA+ (Lamport, 1999), TLC model checker (v2026), Elixir 1.18, Erlang/OTP 27, ZFS, FreeBSD 15.0, and two Mac Pros from 2013.