At 100 entities, Elixir’s persistent Map is faster than ETS. At 1,000, they are even. At 100,000, the Map is 26% slower. The crossover is not where you would expect it, and the reason is not what you would guess.
This is a story about garbage.
The Setup
sim_ex runs a discrete-event simulation in a single tail-recursive
function. No GenServer in the hot path. The loop pops the next event
from a :gb_trees priority queue, looks up the target entity,
dispatches, stores the updated state, inserts new events, recurs. The
question is where the entity states live.
Option A: an Elixir Map. Persistent, immutable, built on a
Hash Array Mapped Trie with branching factor 32. Lookup is O(log32 N)
— four pointer follows at 100,000 entries. Update is also
O(log32 N), but it creates a new root and copies 3–4 internal nodes.
The old nodes become garbage.
Option B: an ETS table. Mutable, lives outside the process heap. Lookup is O(1) — one hash. Update is O(1) — overwrites in place. No garbage. But every read copies the term from ETS heap to process heap, and every write copies the other direction. Two copies per event, no matter the table size.
At small N, copying is more expensive than pointer-following. At large N, garbage collection is more expensive than copying. Somewhere in between, they cross. We measured where.
The Numbers
PHOLD benchmark, 88-core Xeon E5-2699 v4, OTP 27. Same entities, same events, same seed. Only the storage differs.
Without scheduler pinning
| Entities | Map (events/s) | ETS (events/s) | Ratio | Map mem | ETS mem |
|---|---|---|---|---|---|
| 100 | 333,276 | 232,428 | 0.70x | 1.8 MB | — |
| 1,000 | 131,329 | 144,760 | 1.10x | 6.8 MB | 6.7 MB |
| 10,000 | 112,550 | 94,949 | 0.84x | 151 MB | 46 MB |
| 50,000 | 89,880 | 79,011 | 0.88x | 769 MB | 107 MB |
| 100,000 | 88,315 | 80,437 | 0.91x | 788 MB | 222 MB |
Map wins everywhere except 1,000. The persistent data structure is fast enough that its garbage doesn’t matter — because on an unpinned 88-core box, the garbage collector can spread its work across idle schedulers. The BEAM’s GC is per-process, but the memory allocator’s carrier management runs on whichever scheduler is available. With 87 idle schedulers, there is always one available.
With +sbt tnnps
Now pin the schedulers to NUMA nodes. This is what you do in production
— it gives +34% throughput on parallel MCMC workloads by keeping
cache lines local. One flag: --erl "+sbt tnnps".
| Entities | Map (events/s) | ETS (events/s) | Ratio |
|---|---|---|---|
| 100,000 | 63,699 | 80,081 | 1.26x |
ETS wins by 26%. Same code. Same entities. One flag changed.
What Happened
The Map engine at 100,000 entities creates approximately six HAMT nodes
of garbage per event: three from Map.fetch! navigating the
trie (these are on the process heap already, but they pin the old root
alive until the next GC), and three from Map.put copying the
path from root to the updated leaf. At 80,000 events per second, that is
480,000 garbage nodes per second.
Without scheduler pinning, the BEAM’s erts_alloc
carrier management can migrate memory blocks across schedulers. GC pauses
on the Engine process’s scheduler are short because deallocation
work is distributed. Load average at 100K entities: 2–3 on 88 cores.
The extra load is allocator threads, not simulation work.
With +sbt tnnps, schedulers are pinned to specific cores on
specific NUMA nodes. The Engine process’s scheduler can no longer
offload allocator work to other nodes. GC pauses get longer. The process
that creates the garbage must also clean it up, and it can’t do
simulation work while it’s sweeping.
The Map engine went from 88,315 to 63,699 events/sec — a 28% regression from a flag that is supposed to help.
The ETS engine went from 80,437 to 80,081 events/sec — unchanged. Because ETS doesn’t create garbage on the process heap. The entity state lives in ETS’s own memory, managed by its own allocator, invisible to per-process GC. The Engine process’s heap stays small: just the current event, the calendar, and a few temporaries. Nothing to sweep. Nothing to pin.
The Memory Paradox
ETS uses less process memory but more total memory:
| Map | ETS | |
|---|---|---|
| Process heap | 893 MB | ~50 MB |
| ETS heap | 0 | ~1,170 MB |
| Total | 893 MB | 1,223 MB |
ETS stores its own copy of every entity state, and every
lookup_element copies the term to the process heap temporarily.
Total memory is 37% higher. But the process heap — the thing the
garbage collector must traverse — is 18× smaller. GC time is
proportional to live heap size, not total allocated memory. The garbage
collector doesn’t see ETS. That is the entire point.
The Crossover
The crossover depends on whether you pin schedulers:
| Entities | Unpinned | Pinned (tnnps) |
|---|---|---|
| 100 | Map wins | Map wins |
| 1,000 | ETS wins (1.1x) | ETS wins (likely) |
| 10,000 | Map wins (0.84x) | Even (estimated) |
| 100,000 | Map wins (0.91x) | ETS wins (1.26x) |
In production, you pin schedulers. In production with 100K entities, ETS is 26% faster. The optimization that helps parallel workloads (NUMA pinning) hurts the single-threaded Map engine’s GC. ETS is immune because it bypasses per-process GC entirely.
The Lesson
This is the same architectural pattern we keep finding. The JIT boundary in the NUTS sampler: leapfrog inside EXLA, tree builder outside, copy at the boundary. The Engine vs GenServer result: processes for structure, functions for the hot path. And now: persistent data structures for small state, mutable ETS for large state, the crossover determined not by lookup cost but by garbage collector interaction with scheduler topology.
The BEAM is a runtime that gives you immutable data structures by default and mutable shared tables as an escape hatch. Most Elixir developers learn that ETS is for shared state between processes. It is. But it is also for large state within a single process, when the alternative is a persistent data structure that generates garbage faster than the runtime can collect it.
The number that matters is not the lookup cost. It is the GC cost per event, multiplied by the GC interaction with your scheduler topology, multiplied by 80,000 events per second. That product is the reason an O(1) mutable table can lose to an O(log32 N) immutable trie at 10,000 entities and win at 100,000. The algorithm didn’t change. The garbage did.
Is This the Time When Someone Has to Mention Rust?
Yes.
We tried immutable maps. We tried mutable tables. The garbage collector won both arguments. The next argument is in a language that doesn’t have one.
We have done this before. StochTree-Ex needed to evaluate every possible split point across 500 features for 200 trees, 200 iterations. Pure Elixir: 4.7 hours. Rust NIF with pre-sorted column indices: 2 minutes. A 133× speedup — not because Rust is fast, but because one NIF call processed an entire tree without crossing the boundary. The NUTS sampler taught the same lesson: NIF around the outer loop was slower (0.5x) because per-iteration boundary crossing ate the savings. NIF around the inner subtree was faster (1.5x) because it batched 4+ leapfrog steps per call.
The rule: Rust NIFs help when you can batch work inside the boundary.
sim_ex’s event loop cannot batch — each event dispatches to a
user-defined Elixir handle_event/3. A Rust NIF around the
loop would cross the boundary 80,000 times per second, paying the same
tax we just escaped from GenServer.
Unless the entities aren’t Elixir.
The DSL changes everything. seize :barber / hold exponential(16) /
release :barber has no Elixir in the hot path. It is a state
machine with known transitions, known distributions, known resource
protocols. A compiler — not a NIF wrapper, a compiler — could
translate the DSL to Rust: entity states as contiguous Vec,
calendar as BinaryHeap, service times sampled by
rand::distributions, the entire simulation in one NIF call.
No boundary crossing per event. No garbage collection at all. Results
come back as a binary blob, decoded once.
That is not an optimization of the Elixir engine. It is a different engine, generated from the same DSL, for the case where the model is large enough that the BEAM’s memory model becomes the bottleneck. The Elixir engine stays for interactive simulation, live dashboards, hot code reload, fault-tolerant distributed models — everything the BEAM was built for. The Rust engine takes over when you have 100,000 entities and a deadline.
Two runtimes, one DSL, zero compromises. StochTree-Ex proved the pattern: Elixir for orchestration, Rust for the inner computation. The question was never if Rust would enter the simulation engine. The question was what the boundary looks like. Now we know: it looks like a compiler, and the DSL is the interface.
Update: The Third Answer
We wrote this piece arguing about which data structure to use for entity state — Map or ETS — and which language to use for the inner loop — Elixir or Rust. We were optimizing the wrong dimension.
The single-threaded engine runs at load average 1.0 on 88 cores. Eighty-seven schedulers idle. But replications are independent — replication 1 has no dependency on replication 2. One thousand replications across 88 schedulers: 207 milliseconds in Rust, 683 milliseconds in Elixir. The analysis that was “rarely done” finishes before the slide deck loads.
The Map vs ETS question still matters for one run at 100K entities. But the question that matters for the plant manager is not “how fast is one run?” It is “how fast is the analysis?” And the answer is: thirty to one.
sim_ex is at
github.com/borodark/sim_ex.
Five engine modes: Map, ETS, Diasca, Parallel, Rust NIF. 120 tests
(including property-based and adversarial statem). Parallel replications
by default. The benchmark is in benchmark/full_bench.exs.