We had 77 tests. All passing. All using seed 42.
Seed 42 is an anecdote. It tells you that the simulation works for one particular sequence of random numbers on one particular Tuesday. It does not tell you that the simulation is correct.
Correctness, for a stochastic system, is not a trajectory. It is an invariant — a property that holds regardless of which trajectory the random numbers choose. The question is not “does the barbershop produce 12,449 customers with seed 42?” The question is: for any stable M/M/1 queue, does the mean number of entities in the system equal the arrival rate times the mean time in the system?
That is Little’s Law. It was proved in 1961. It holds for every stable queueing system regardless of arrival distribution, service distribution, or scheduling discipline. If your simulation violates it, your simulation is wrong. If your test suite does not check it, your test suite is incomplete.
Ours was incomplete.
The Method
Property-based testing replaces fixed inputs with random generators and fixed assertions with invariants. Instead of “run the barbershop with lambda=1/18 and check that 12,449 customers complete,” you say: “generate a random M/M/1 queue with arrival rate between 0.2 and 2.0, service rate between 0.3 and 10.0, reject any configuration where utilization exceeds 0.9, run the simulation, and verify that observed sojourn time is within 20% of the theoretical value.”
Then you do that thirty times. With different random parameters each time.
The harness is eleven lines of Elixir:
def check(n_trials, gen_fn, body_fn) do
base_seed = :erlang.unique_integer([:positive])
Enum.each(1..n_trials, fn i ->
seed = base_seed + i * 7919
params = gen_fn.(seed)
try do
body_fn.(params)
rescue
e -> reraise("Trial #{i}, seed=#{seed}\n" <>
inspect(params) <> "\n" <>
Exception.message(e), __STACKTRACE__)
end
end)
end
No external dependencies. No StreamData. No PropEr. Just a loop, a seed, and a try/rescue that tells you exactly which parameters failed. The generator rejects unstable configurations (rho ≥ 0.9) because those need millions of events to converge, and we are testing the invariant, not our patience.
Seven Properties
| Property | Invariant | Trials | Tolerance |
|---|---|---|---|
| Flow conservation | departures ≤ arrivals | 50 | Exact |
| Little’s Law | Wobs ≈ 1/(μ - λ) | 30 | 20% |
| Throughput convergence | departures/time ≈ λ | 40 | 15% |
| Determinism | same seed = same result | 30 | Exact |
| Cross-mode equivalence | Engine = ETS | 20 | Exact |
| DSL flow conservation | arrivals = completed + in_progress | 50 | Exact |
| Edge cases | decide(0.0) routes nobody, combine(1) = identity | 80 | Exact |
Three hundred and fifty trials across random parameter spaces. The stochastic properties (Little’s Law, throughput) use generous tolerances — 15-20% — because a single simulation run has Monte Carlo noise. The structural properties (flow conservation, determinism, edge cases) are exact. There is no tolerance for losing a customer between arrival and departure.
What Little’s Law Tests
The generator produces random M/M/1 parameters:
defp gen_mm1_loop(rs, seed) do
{lambda_inv, rs} = uniform_range(rs, 0.5, 5.0)
{mu_inv, rs} = uniform_range(rs, 0.1, 3.0)
lambda = 1.0 / lambda_inv
mu = 1.0 / mu_inv
rho = lambda / mu
if rho >= 0.80 or rho < 0.20 do
gen_mm1_loop(rs, seed) # reject, try again
else
%{w_theory: 1.0 / (mu - lambda), ...}
end
end
Each trial generates a queue with utilization between 0.2 and 0.8. The theoretical mean sojourn time is W = 1/(μ - λ). The simulation runs for 50,000-80,000 time units, measures the observed mean wait plus mean service time, and checks that the relative error is under 20%.
Twenty percent sounds generous. It is. A single simulation replication with 50,000 time units can easily be 10-15% off the theoretical value due to Monte Carlo noise. The tolerance accounts for this. A tighter tolerance would produce false negatives — tests that fail because the simulation is stochastic, not because it is wrong.
All thirty trials passed. Erlang’s 1909 formula and sim_ex’s 2026 engine agree across the full utilization range.
What Cross-Mode Equivalence Tests
sim_ex has two Elixir execution modes. The Engine stores entity state in an Elixir Map. The ETS mode stores it in an Erlang Term Storage table. Both should produce identical results for the same seed.
They do. For twenty random configurations, the event count, arrival count, departure count, and mean wait time are bitwise identical between modes. Not approximately equal. Identical. This is a stronger claim than most simulation frameworks make: the storage backend is truly invisible to the simulation logic.
The Bug That Point Tests Missed
While building the utilization property test, we discovered that
Sim.Resource.busy_time was never incremented. It was
initialized to 0.0 in init/1 and never updated by any
handle_event clause. The utilization statistic —
computed as busy_time / last_event_time — was always
zero.
This bug had been present since the initial release. Seventy-seven point tests passed around it because none of them checked utilization against the theoretical value. They checked arrival counts, departure counts, mean wait times, queue lengths. Utilization was reported but never verified.
The property test caught it immediately. “Utilization should be within 15% of rho” fails when utilization is zero and rho is 0.5. The fix: track when each job begins service, accumulate elapsed time on departure, report the ratio. Five lines of code that had been missing for four months.
This is why property tests exist. Not because they are more rigorous than point tests — they are. But because they ask different questions. A point test asks “does the output match this number?” A property test asks “does the output obey this law?” The law catches bugs that no specific number can.
Edge Cases as Properties
The boundary behavior of DSL verbs is not stochastic. It is deterministic and exact:
decide(0.0, :label)routes zero parts to the label. Not “approximately zero.” Zero.decide(1.0, :label)routes every part to the label.combine(1)is the identity — flow conservation is preserved.split(3)followed bycombine(3)conserves entity count.
These are tested across twenty random seeds each. The seed affects arrival times and service durations but should not affect the routing fraction when the probability is 0 or 1. Any failure would indicate that the DSL compiler leaks randomness into deterministic decisions.
None failed. The DSL compiler is clean.
What SimPy Does Not Have
SimPy has 15,000 GitHub stars. It has three tests in its test suite that verify statistical properties. The rest are structural — “does this generator yield the right events?” No Little’s Law. No flow conservation across random parameters. No cross-backend equivalence.
Arena has no automated tests at all. It is a commercial product that sells confidence through market share, not through verified invariants.
This is the opening that property tests create. Not a speed advantage — we have that, and it is documented honestly at 14x. A correctness advantage. A framework that proves its invariants hold across random parameter spaces is making a claim that no other open-source DES framework makes.
The Lesson
A simulation is a Monte Carlo experiment. Its outputs are random. But its invariants are not. Little’s Law does not depend on the seed. Flow conservation does not depend on the arrival rate. Determinism does not depend on the utilization.
Point tests check outputs. Property tests check invariants. The invariants were always there — in the queueing theory textbooks, in Erlang’s 1909 paper, in Little’s 1961 proof. We just were not testing them.
Now we are. 112 tests. 350 trials. One bug found. Seed 42 is retired.
Property tests are at github.com/borodark/sim_ex/test/property_test.exs. The harness is 27 lines. The generators are 50. The properties are the interesting part.