We raced SimPy. The per-replication numbers were honest but unremarkable: 2–3x in Elixir, 8–10x in Rust. Then we ran one thousand replications in parallel and the margin became thirty to one.
This is the story of how we measured the wrong thing, measured it honestly, and then found the right thing to measure.
The Per-Rep Race
Same models. Same parameters. Sequential execution — Python runs first, then Elixir, each getting the full machine. An 88-core Xeon. Python 3.12.3, SimPy 4.1.1. Elixir 1.18.3, OTP 27.
| Model (200K time units) | SimPy | sim_ex Elixir | sim_ex Rust |
|---|---|---|---|
| Barbershop | 127ms | 55ms (2.3x) | 15ms (8.5x) |
| Job Shop (5 stages) | 2,479ms | 1,200ms (2.1x) | — |
| Rework Loop (15%) | 601ms | 225ms (2.7x) | — |
Elixir is two to three times faster than SimPy on a single replication. The Rust NIF is eight times faster. These are real numbers, measured honestly, reported with the system load average because benchmark variance across runs is 30–40%.
These are also the wrong numbers.
The Wrong Numbers
A single simulation run tells you one trajectory through the state space. It does not tell you the distribution of outcomes. It does not give you confidence intervals. It does not propagate input uncertainty. Averill Law calls the analysis that does this — hundreds or thousands of replications with different parameter draws — “rarely done because it’s too expensive.”
The expense is the per-replication wall time multiplied by the number of replications. On one core. Sequentially. SimPy: 6.3 milliseconds per replication at 200,000 time units. One thousand replications: 6.3 seconds. This is the number the plant manager waits for.
On one core, Elixir is slower than SimPy: 18.2 milliseconds per replication. The Map.fetch! overhead at this entity count dominates. Python’s generators are faster than Elixir’s hash trie lookups for large simulations. One thousand sequential replications in Elixir: 18.2 seconds. Worse than SimPy.
This is the fact we could have hidden. We didn’t.
The Right Numbers
Replications are independent. Replication 1 with seed 1 has no dependency on replication 2 with seed 2. Each produces an independent trajectory. The only coordination is collecting the results.
The BEAM virtual machine runs 88 schedulers on 88 cores. Each scheduler
is a native OS thread with its own run queue. There is no Global
Interpreter Lock. Task.async_stream distributes one
thousand replications across all schedulers with one function call.
| Configuration | 1,000 reps × 200K | Per-rep | vs SimPy |
|---|---|---|---|
| SimPy (Python, sequential) | ~6,300ms | 6.3ms | 1.0x |
| Elixir sequential | 18,212ms | 18.2ms | 0.3x |
| Elixir parallel (88 cores) | 683ms | 0.7ms | 9.4x |
| Rust NIF sequential | 7,391ms | 7.4ms | 0.9x |
| Rust NIF parallel (88 cores) | 207ms | 0.2ms | 30x |
Two hundred and seven milliseconds. One thousand complete simulation replications, each with a different random seed, each running 200,000 time units. In less time than it takes to blink.
The Elixir engine — which is slower than SimPy per-replication — finishes in 683 milliseconds. Nine times faster than SimPy. Not because each replication is faster. Because 88 replications run at the same time.
Thirty to one.
Why SimPy Cannot Do This
Python’s Global Interpreter Lock prevents true thread parallelism.
multiprocessing spawns OS processes with IPC overhead.
concurrent.futures serializes through the GIL.
asyncio is cooperative — one thread, interleaved
execution. None of these gives you 88 independent simulations running
on 88 cores with zero coordination overhead.
The BEAM was designed for ten thousand concurrent telephone switches. One thousand concurrent simulations is a rounding error.
The One-Line Change
results = Sim.Experiment.replicate(fn seed ->
{:ok, r} = MyModel.run(seed: seed, stop_time: 200_000.0)
r.stats[:machine].mean_wait
end, 1000)
Parallel by default. Task.async_stream with
max_concurrency: System.schedulers_online(). No
configuration. No thread pool. No MPI. Every sim_ex user with a
multi-core machine gets 30x over SimPy without changing a line of
model code.
What SimPy Still Wins
Fifteen thousand GitHub stars. Hundreds of tutorials. A coroutine syntax that feels natural to anyone who has written Python. An ecosystem of monitoring tools, logging hooks, and integration libraries built over fifteen years by a community that dwarfs ours by three orders of magnitude.
These are real advantages. They are the advantages of incumbency. Speed is not among them — not at the analysis scale that matters.
What sim_ex Wins Beyond Speed
The DSL is the differentiator. Not the 2x per-rep. Not the 30x
parallel. The fact that a manufacturing engineer can read
seize :machine / hold exponential(8.0) / release :machine
and know what it means without being told.
The Bayesian integration is the moat. SimPy has no equivalent of “run 1,000 replications with posterior-sampled parameters from eXMC.” It has no particle filter for streaming calibration. It has no BART for sensitivity analysis. It is a simulation framework. sim_ex is a simulation framework that connects to an inference ecosystem. The speed makes that connection practical. The connection makes the speed valuable.
The Numbers That Matter
| What you care about | SimPy | sim_ex |
|---|---|---|
| One simulation run | Fine | Fine (2-3x faster) |
| 1,000 replications for UQ | 6.3 seconds | 207 milliseconds |
| Can a process engineer read it? | No | Yes |
| Does it learn from data? | No | Yes (eXMC + smc_ex) |
| Does it use all your cores? | No (GIL) | Yes (88 schedulers) |
The single-run speedup is nice. The parallel speedup changes what’s possible. The DSL changes who can use it. The Bayesian integration changes what it means.
Full results and reproduction scripts at github.com/borodark/sim_ex/benchmark/simpy_race. Same models. Same parameters. Sequential and parallel. Fair fight. Load average reported with every measurement.