Eighty-Seven Idle Schedulers — dataalienist.com

The benchmark report said it clearly: load average 1.0 on 88 cores. One scheduler doing all the work. Eighty-seven watching. Sched% at 0.6%. The tight-loop engine, by design, trades parallelism for zero-overhead dispatch. One process, one calendar, one entity map, one tail-recursive function call. Fast. Sequential. Lonely.

This is the correct architecture for one simulation run. Next-event time advance is inherently sequential — event N+1 depends on the state produced by event N. You cannot parallelize the event loop without changing the semantics. Sim-Diasca tried with tick-diasca barriers. We tried with the parallel engine. Both add overhead that exceeds benefit for cheap events.

The question is not how to parallelize one run. The question is what to do with the other 87 schedulers.

The Answer Was Always Replications

Input uncertainty analysis requires running the same simulation hundreds or thousands of times with different parameter draws. Averill Law calls this “rarely done because it’s too expensive.” The expense is the per-replication wall time multiplied by the number of replications. Sequential expense. One core, one replication at a time, one thousand times.

But replications are independent. Replication 1 with seed 1 has no dependency on replication 2 with seed 2. Each produces an independent trajectory through the state space. The only coordination needed is collecting the results afterward.

This is what the BEAM was built for.

results = Sim.Experiment.replicate(fn seed ->
  {:ok, r} = MyModel.run(seed: seed, stop_time: 200_000.0)
  r.stats[:machine].mean_wait
end, 1000)

One function call. One thousand replications. Parallel by default. Task.async_stream with max_concurrency: System.schedulers_online(). No thread pool configuration. No MPI. No Ray. No Dask. The BEAM runtime schedules 88 concurrent simulations across 88 schedulers with zero user-visible infrastructure.

The Numbers

Configuration	1,000 reps × 200K	Per-rep	vs SimPy
SimPy (Python, sequential)	~6,300ms	6.3ms	1.0x
Elixir sequential	18,212ms	18.2ms	0.3x
Elixir parallel (88 cores)	683ms	0.7ms	9.4x
Rust NIF sequential	7,391ms	7.4ms	0.9x
Rust NIF parallel (88 cores)	207ms	0.2ms	30.4x

Read that table carefully. Elixir sequential is slower than SimPy — 18.2ms per replication versus 6.3ms. The Map.fetch! overhead at 200,000 time units dominates. Python’s generators are faster than Elixir’s hash trie lookups for large entity maps. On a single core, SimPy wins.

On 88 cores, Elixir is 9.4x faster than SimPy. Not because each replication is faster — it is not. Because 88 replications run simultaneously. The per-replication cost is irrelevant when 87 other replications are running at the same time.

The Rust NIF parallel is 30.4x faster. Each NIF runs on a dirty scheduler, 88 in parallel, zero garbage collection contention between them. Two hundred and seven milliseconds for one thousand complete simulation replications.

Why VM Flags Don’t Help

We tested five Erlang VM flag configurations:

Flags	Time (176 reps)
default	135ms
`+sbt tnnps` (NUMA-aware)	132ms
`+sbt ts` (thread spread)	126ms
`+sbt tnnps +sub true`	138ms
`+sbt tnnps +swt very_low`	140ms

All within 10% of each other. NUMA pinning, scheduler utilization balancing, wakeup thresholds — none of them matter. The replications are independent: no shared state, no cross-scheduler communication, no lock contention. The BEAM’s default scheduler is already optimal for embarrassingly parallel workloads.

This is the opposite of the ETS engine finding. There, scheduler pinning changed the result by 26% because garbage collection and memory allocation compete for NUMA bandwidth. Here, each replication is a self-contained world with its own Map, its own calendar, its own PRNG state. The schedulers never talk to each other. The default is the answer.

The One-Line Change

Sim.Experiment.replicate now defaults to parallel: true. That is the entire change. The function already accepted a parallel: option. The default was false. Now it is true.

# Before: sequential by default, opt-in parallel
results = Sim.Experiment.replicate(run_fn, 1000, parallel: true)

# After: parallel by default, opt-out sequential
results = Sim.Experiment.replicate(run_fn, 1000)
# sequential for debugging:
results = Sim.Experiment.replicate(run_fn, 1000, parallel: false)

Every sim_ex user with a multi-core machine gets 10-30x over SimPy without changing a line of their model code.

The Lesson

The BEAM’s killer feature for simulation is not making one run fast. It is making 88 runs simultaneously with zero infrastructure. The per-replication cost matters less than the per-replication concurrency. A framework that runs each replication in 18ms but runs 88 at once finishes before a framework that runs each in 6ms but can only run one at a time.

SimPy cannot do this. Python’s Global Interpreter Lock prevents true parallelism. multiprocessing adds IPC overhead. concurrent.futures serializes through the GIL. The BEAM has no GIL. Each scheduler is a native OS thread running its own simulation in complete isolation. The runtime was designed for ten thousand concurrent telephone switches. One thousand concurrent simulations is a rounding error.

Eighty-seven schedulers are no longer idle.

Sim.Experiment.replicate is at github.com/borodark/sim_ex/lib/sim/experiment.ex. Parallel by default. One function, one line, all your cores.