Four Months, Eight Hundred Lines

The previous post in this series — The Release That Never Seized — ended with a four-line counter-example. Initialize a resource, seize it, release it, release it again. PropEr generated the sequence, the engine accepted the second release, and the bug was a four-line shrunk diagnosis: release of an ungranted job. A protocol violation. Crisp. The shape every developer wants to see when a test fails.

This post is about two bugs in the eXMC NUTS sampler that have no four-line counter-example. They cannot be shrunk to a minimal sequence. The shrinker would shrug. Every command we could generate is a legal command. Every result it returns is a legal result. The bugs live in the distribution of results across many calls, not in any single call.

They took four months to find by hand. The proper_statem models that codify the regression checks are eight hundred lines.

The Symptom

NUTS — the No-U-Turn Sampler — is the workhorse of modern Bayesian inference. It builds a binary tree of trajectories through parameter space using Hamiltonian dynamics, doubles the tree until the trajectory turns back on itself, and selects a final proposal from all the points it visited via a multinomial draw weighted by their log-probabilities.

The eXMC implementation passed every correctness test we could write. Sample means matched conjugate posteriors. Variance estimates matched analytic answers. Trace plots looked Gaussian. Acceptance rates sat near the 0.65 target. Divergence counts were low. The samples were valid.

And the chains were running at roughly half the speed they should have been.

ESS — effective sample size — is the only metric that matters for an MCMC sampler. A chain that produces ten thousand samples whose autocorrelation is 0.9 has the inferential power of one thousand independent draws. On the medium model — a hierarchical Gaussian in eight dimensions — PyMC was producing roughly three independent draws for every one that Exmc produced, on the same problem, in the same wall-clock time. After the two-line fix described below, the delta on a single chain, five seeds, was:

Model	Dim	ESS/s improvement	vs PyMC after fix
simple	4	+63%	0.81×
medium	8	+192%	1.90×
stress	22	+81%	1.16×

The duplicate rate — the fraction of NUTS proposals that returned the starting position unchanged — dropped from 37.7% to 6.5%. PyMC’s rate on the same medium model was 7.8%. The samples were correct before the fix and correct after. The information content per sample was wrong before the fix and right after.

Wrong how? After four months of profiling, gradient checking, mass matrix tuning, and re-derivation of the leapfrog integrator, the answer was two lines of code in the tree builder.

Smoking Barrel One: The Capped Leaf Weight

At the leaf of every NUTS subtree, the algorithm computes a weight for the new trajectory point: the probability density that the multinomial sampler will use to choose between competing proposals. The standard formula is:

delta = joint_logp_new - joint_logp_initial
log_weight = delta

The eXMC code, since the first commit, had:

delta = joint_logp_new - joint_logp_initial
log_weight = min(0.0, delta)

The cap was a confusion with a different formula. The acceptance probability for the dual-averaging step-size adaptation is min(1, exp(delta)) — which becomes min(0, delta) in log space. That cap is correct for the adaptation update. It is wrong for the multinomial weight.

The effect: every time a new trajectory point landed in a region of higher probability than the start — that is, every time the leapfrog discovered a better place to be — the multinomial weight was clamped at exp(0) = 1. The trajectory point that was supposed to outweigh the initial point by a factor of three was tied with the initial point. The trajectory point that was supposed to outweigh it by a factor of twenty was tied with the initial point. The multinomial draw became a coin flip when it should have been a near-certainty.

The chain still moved — sometimes the coin flip went the right way. But across many trees, the initial point won the multinomial far too often. The proposal distribution stayed close to where the chain already was, the duplicate rate inflated to nearly forty percent, and the chain crawled where it should have walked. Like Chev Chelios off his adrenaline drip, it was technically alive and technically moving, but it was not actually going anywhere. Crank is the right film here for a reason — the whole point of NUTS is that it keeps the momentum up. Cap the weights and the momentum flatlines.

Smoking Barrel Two: The Merge That Wouldn’t Let Go

NUTS builds its tree by recursive doubling. At each step it builds a new subtree on the left or right of the current trajectory and merges it in. There are two merge functions: merge_subtrees, which combines two equal-depth subtrees into a deeper one (the inner merge), and merge_trajectories, which adds a fresh subtree onto the growing trajectory at the outer doubling level.

Both functions select a single proposal from the merged result. The selection rule for the inner merge is balanced — the second subtree wins with probability w_2 / (w_1 + w_2), weighted by log-sum-weights. The selection rule for the outer merge, in Stan and PyMC, is biased progressive: the new subtree replaces the trajectory with probability min(1, w_subtree / w_trajectory). The distinction matters because the trajectory has already accumulated many points and the subtree has only a few; the biased rule corrects for the asymmetry.

The eXMC implementation used the balanced rule for both merges.

The effect, again, was that the initial point was sticky. At each outer doubling step, the existing trajectory — which contained the start position — survived the merge with too high a probability. The biased rule would let new subtrees displace the trajectory whenever they were even slightly better; the balanced rule gave the trajectory a fifty-percent floor regardless. Across hundreds of merges in a typical sampling run, the start position accumulated selection probability that should have flowed to better proposals.

The two bugs reinforced each other. The first capped the weight of good leaves. The second over-protected the trajectory that contained mediocre leaves. Together they produced the 37.7% duplicate rate quoted above, and they dropped the effective sample size by the fractions in the table at the top of this post.

What Tests Were Already in Place

Before these bugs were found, the eXMC test suite ran roughly two hundred ExUnit cases. Each one set up a model, ran the sampler, and asserted on the result. The assertions were of three kinds:

Assertion type	Example	Why it missed the bugs
Numerical correctness	`mean(samples) ≈ posterior_mean`	Both bugs preserve the stationary distribution. The mean was right.
Acceptance rate	`accept_rate ∈ [0.55, 0.85]`	Acceptance rate measures dual-averaging quality, not multinomial bias.
Divergence count	`divergences < 5%`	The bugs do not cause divergences. Divergences come from numerical instability.

None of these can fail. The bugs are biases in the mixing rate, not errors in the limit. A biased sampler with infinite samples reaches the same stationary distribution as an unbiased one. The information loss is in the route, not the destination. The route is what nobody was testing.

Property-based tests — the kind that generate random parameters and check Little’s Law style invariants — were equally silent. You can generate a thousand random posterior shapes and verify that the sampler returns the right mean for each one. The bugs do not violate any per-sample invariant. They violate a property of the distribution of selections across many trees, and that distribution is a population statistic, not an individual one.

The Postcondition That Catches Distributional Bugs

PropEr’s proper_statem generates sequences of operations against an abstract state machine. The abstract state can carry whatever you want. For sim_ex, the abstract state was the calendar, the entity map, and the resource holder list — logical things, with logical postconditions. busy ≤ capacity. grants ≥ releases. Pass or fail on each operation.

For NUTS, the abstract state is statistical. It accumulates:

%{
  builds: 0,
  accept_rates: [],
  duplicate_count: 0,
  divergent_count: 0
}

Each command in the sequence builds a new tree from a freshly sampled momentum. The result feeds the accumulator: another acceptance rate appended, the duplicate counter incremented if the proposal happened to equal the starting position, the divergent counter incremented if the trajectory blew up. After the model has built ten or twenty trees, a check_statistics command fires, and the postcondition checks not the most recent build but the empirical distribution across all of them:

def postcondition(_, {:call, _, :check_statistics, _}, result) do
  result.mean_accept_rate >= 0.1 and
    result.duplicate_rate < 0.50
end

This is the new shape of postcondition. Not after this command, the resource is consistent. Instead, after these N commands, the empirical distribution is healthy. The bug PropEr is hunting is not a single illegal transition. It is a population statistic that drifts when the algorithm is wrong.

The duplicate rate is the diagnostic. A clean NUTS sampler on a standard normal target produces a duplicate rate well under fifteen percent. A sampler with capped log-weights and balanced outer merge produces a duplicate rate near forty percent. The postcondition catches the difference. The diagnosis is not release(3) without seize(3). It is duplicate_rate = 0.378, mean_accept = 0.65, n = 20 — a fingerprint, not a counter-example. Two statistics that are individually plausible but jointly incompatible with a correctly implemented multinomial.

The Mechanic and the Transporter

The eXMC test suite now contains two proper_statem models for the NUTS tree builder. Together they are 789 lines. The division of labour is Jason Statham’s, whether he knows it or not. The first model is the mechanic — it takes subtrees apart and bolts them back together, and its postconditions verify that every screw is accounted for. The second model is the transporter — it moves real Hamiltonian momentum through real phase space and checks that the package arrives intact.

The transporter model, in particular, has to obey Frank Martin’s rules. Rule one: never change the deal — energy conservation is sacred, the Hamiltonian must be preserved step to step through the leapfrog integrator. Rule two: no names — the integrator treats every dimension identically, regardless of what the user called it. Rule three: never open the package — don’t peek at intermediate trajectory points, just push the momentum forward and hand the final package to the multinomial. When any of these rules is broken, the accept rate drifts, the duplicate rate inflates, and the transporter model catches it in the accumulator state.

Model	Lines	Layer	What it locks down
Merge protocol	376	merge_subtrees, merge_trajectories	log-sum-weight additivity, rho additivity, depth tracking, n_steps accounting, divergent monotonicity
Tree builder	413	full Tree.build with leapfrog	structural invariants per build, accept rate distribution across builds, duplicate rate across builds

The merge model generates synthetic subtrees with random log-weights, random momenta, and random rho vectors, then runs random sequences of merge_subtrees and merge_trajectories calls against them. No leapfrog, no gradients, no joint log-probabilities — just the merge protocol in isolation. Three hundred adversarial sequences run in a few seconds. The postconditions verify that every merge satisfies the algebraic invariants that the merge formulas imply. log_sum_exp(left.lsw, right.lsw) == merged.lsw. rho(left) + rho(right) == rho(merged). n_steps(merged) == n_steps(left) + n_steps(right). These caught two unrelated merge bugs during their first run, both in the rho-tracking code.

The tree model is the statistical one. It builds full trees against a standard normal target with random dimensions, random step sizes, and random max-depths, accumulates the per-build statistics in process state, and the check_statistics command exposes the accumulator to the postcondition. Each property test runs a dozen or so builds before a check, and the property runs a hundred such sequences. About two thousand trees per property invocation, accumulated into a hundred separate statistical fingerprints.

A sampler with the capped-weight bug produces fingerprints with accept rates near 0.65 and duplicate rates near 0.4. The postcondition trips. PropEr shrinks the failing sequence — not to a four-line counter-example, because the bug is not in any single command, but to the smallest dimension and shortest depth at which the statistical deviation is reproducible. The diagnosis is the smallest model configuration in which the population statistic still shows the fingerprint.

Why the Statistical Postcondition Is Hard

This style of test is rare for a reason. The signal-to-noise ratio is much lower than for a logical postcondition. busy ≤ capacity is true or false on every operation; one violation is one bug. duplicate_rate < 0.15 is a population statistic with sampling noise; you need enough samples to distinguish the bug fingerprint from the natural variance. Set the bound too tight and random epsilon choices flag false positives. Set it too loose and real bugs sail through.

The current bound, duplicate_rate < 0.50, is loose. It catches catastrophic multinomial bias — the kind where the sampler is essentially stuck. It would not have caught the 0.378 fingerprint in production. It is a smoke alarm, not a thermostat. The next iteration tightens it and adds a comparative check against the theoretical acceptance distribution recorded per merge in the merge model:

# Per-merge: empirical accept (0/1) and theoretical (min(1, exp(delta_lsw)))
# Postcondition: |empirical_rate - theoretical_rate| / sqrt(n) within 2 sigma

That is the postcondition that catches the biased outer merge in isolation. It compares the rate at which the model actually accepts new subtrees against the rate the algorithm specifies. The current merge model records both numbers but does not yet assert on the ratio. The next phase does.

The Lesson

Three dimensions of testing again, restated for distributional bugs:

Dimension	What varies	What it catches	Tool
Point tests	Nothing (seed=42)	Regressions in numerical answers	ExUnit
Property tests	Random parameters	Per-sample invariants (mean, variance, support)	PropCheck
State machine tests (logical postcondition)	Random sequences	Per-operation protocol invariants	proper_statem
State machine tests (statistical postcondition)	Random sequences with accumulator state	Distributional fingerprints across many calls	proper_statem with statistical postconditions

The fourth row is what these two NUTS models add to the toolbox. The postcondition is not this transition is legal. It is after these N transitions, the empirical distribution matches the algorithm’s specification within sampling noise. The accumulator state is the entire point: PropEr generates the operation sequences, the model accumulates the per-operation fingerprints, and the postcondition compares accumulator against specification.

The sim_ex bug was a logical violation: a release with no preceding seize. PropEr’s shrinker reduced it to four operations and the fix was a guard clause. The eXMC bugs were distributional violations: multinomial selections drifting from their specified distribution. PropEr’s shrinker cannot reduce them to a single bad operation, because no single operation is bad. The diagnosis is a population fingerprint, the shrinker reduces the model dimension rather than the operation count, and the fix is a pair of one-line formula corrections in the tree builder.

Both bugs were silent for months. Both are now regression-tested by proper_statem models that run in seconds. Eight hundred lines of test code locks down four months of debugging. The next time someone touches the multinomial weights or the merge protocol, the fingerprint will fire on the next test run.

The lesson is not that proper_statem is universal. It is that the postcondition layer is more general than “assert on the result.” You can assert on aggregated state, on empirical distributions, on differences between observed and theoretical rates. Anywhere a bug expresses itself as a population statistic rather than a single bad output, the statem accumulator is the right place to catch it. MCMC samplers are full of such bugs. So are caches, schedulers, load balancers, anything where the output of any one operation is plausible and the bug shows up only in the long-run rate.

Jason Statham still always delivers. Sometimes the delivery is a four-line counter-example. Sometimes it is a histogram. Sometimes — as in this case — it is a population fingerprint accumulated across twenty tree builds, a duplicate rate of 0.378 that should have been 0.065, and a pair of one-line fixes to the multinomial weight. The franchise expands.

The two NUTS proper_statem models are at test/nuts/statham_merge_test.exs and test/nuts/statham_tree_test.exs. The fixes for the two bugs — uncapping the leaf log-weight and switching the outer merge to biased progressive sampling — are two single-line commits. Together they raised eXMC’s ESS-per-second on the medium model from 0.33× PyMC to 1.90× PyMC.