The Map That Remembers Too Much

At 100 entities, Elixir’s persistent Map is faster than ETS. At 1,000, they are even. At 100,000, the Map is 26% slower. The crossover is not where you would expect it, and the reason is not what you would guess.

This is a story about garbage.

The Setup

sim_ex runs a discrete-event simulation in a single tail-recursive function. No GenServer in the hot path. The loop pops the next event from a :gb_trees priority queue, looks up the target entity, dispatches, stores the updated state, inserts new events, recurs. The question is where the entity states live.

Option A: an Elixir Map. Persistent, immutable, built on a Hash Array Mapped Trie with branching factor 32. Lookup is O(log32 N) — four pointer follows at 100,000 entries. Update is also O(log32 N), but it creates a new root and copies 3–4 internal nodes. The old nodes become garbage.

Option B: an ETS table. Mutable, lives outside the process heap. Lookup is O(1) — one hash. Update is O(1) — overwrites in place. No garbage. But every read copies the term from ETS heap to process heap, and every write copies the other direction. Two copies per event, no matter the table size.

At small N, copying is more expensive than pointer-following. At large N, garbage collection is more expensive than copying. Somewhere in between, they cross. We measured where.

The Numbers

PHOLD benchmark, 88-core Xeon E5-2699 v4, OTP 27. Same entities, same events, same seed. Only the storage differs.

Without scheduler pinning

Entities	Map (events/s)	ETS (events/s)	Ratio	Map mem	ETS mem
100	333,276	232,428	0.70x	1.8 MB	—
1,000	131,329	144,760	1.10x	6.8 MB	6.7 MB
10,000	112,550	94,949	0.84x	151 MB	46 MB
50,000	89,880	79,011	0.88x	769 MB	107 MB
100,000	88,315	80,437	0.91x	788 MB	222 MB

Map wins everywhere except 1,000. The persistent data structure is fast enough that its garbage doesn’t matter — because on an unpinned 88-core box, the garbage collector can spread its work across idle schedulers. The BEAM’s GC is per-process, but the memory allocator’s carrier management runs on whichever scheduler is available. With 87 idle schedulers, there is always one available.

With `+sbt tnnps`

Now pin the schedulers to NUMA nodes. This is what you do in production — it gives +34% throughput on parallel MCMC workloads by keeping cache lines local. One flag: --erl "+sbt tnnps".

Entities	Map (events/s)	ETS (events/s)	Ratio
100,000	63,699	80,081	1.26x

ETS wins by 26%. Same code. Same entities. One flag changed.

What Happened

The Map engine at 100,000 entities creates approximately six HAMT nodes of garbage per event: three from Map.fetch! navigating the trie (these are on the process heap already, but they pin the old root alive until the next GC), and three from Map.put copying the path from root to the updated leaf. At 80,000 events per second, that is 480,000 garbage nodes per second.

Without scheduler pinning, the BEAM’s erts_alloc carrier management can migrate memory blocks across schedulers. GC pauses on the Engine process’s scheduler are short because deallocation work is distributed. Load average at 100K entities: 2–3 on 88 cores. The extra load is allocator threads, not simulation work.

With +sbt tnnps, schedulers are pinned to specific cores on specific NUMA nodes. The Engine process’s scheduler can no longer offload allocator work to other nodes. GC pauses get longer. The process that creates the garbage must also clean it up, and it can’t do simulation work while it’s sweeping.

The Map engine went from 88,315 to 63,699 events/sec — a 28% regression from a flag that is supposed to help.

The ETS engine went from 80,437 to 80,081 events/sec — unchanged. Because ETS doesn’t create garbage on the process heap. The entity state lives in ETS’s own memory, managed by its own allocator, invisible to per-process GC. The Engine process’s heap stays small: just the current event, the calendar, and a few temporaries. Nothing to sweep. Nothing to pin.

The Memory Paradox

ETS uses less process memory but more total memory:

	Map	ETS
Process heap	893 MB	~50 MB
ETS heap	0	~1,170 MB
Total	893 MB	1,223 MB

ETS stores its own copy of every entity state, and every lookup_element copies the term to the process heap temporarily. Total memory is 37% higher. But the process heap — the thing the garbage collector must traverse — is 18× smaller. GC time is proportional to live heap size, not total allocated memory. The garbage collector doesn’t see ETS. That is the entire point.

The Crossover

The crossover depends on whether you pin schedulers:

Entities	Unpinned	Pinned (tnnps)
100	Map wins	Map wins
1,000	ETS wins (1.1x)	ETS wins (likely)
10,000	Map wins (0.84x)	Even (estimated)
100,000	Map wins (0.91x)	ETS wins (1.26x)

In production, you pin schedulers. In production with 100K entities, ETS is 26% faster. The optimization that helps parallel workloads (NUMA pinning) hurts the single-threaded Map engine’s GC. ETS is immune because it bypasses per-process GC entirely.

The Lesson

This is the same architectural pattern we keep finding. The JIT boundary in the NUTS sampler: leapfrog inside EXLA, tree builder outside, copy at the boundary. The Engine vs GenServer result: processes for structure, functions for the hot path. And now: persistent data structures for small state, mutable ETS for large state, the crossover determined not by lookup cost but by garbage collector interaction with scheduler topology.

The BEAM is a runtime that gives you immutable data structures by default and mutable shared tables as an escape hatch. Most Elixir developers learn that ETS is for shared state between processes. It is. But it is also for large state within a single process, when the alternative is a persistent data structure that generates garbage faster than the runtime can collect it.

The number that matters is not the lookup cost. It is the GC cost per event, multiplied by the GC interaction with your scheduler topology, multiplied by 80,000 events per second. That product is the reason an O(1) mutable table can lose to an O(log32 N) immutable trie at 10,000 entities and win at 100,000. The algorithm didn’t change. The garbage did.

Is This the Time When Someone Has to Mention Rust?

Yes.

We tried immutable maps. We tried mutable tables. The garbage collector won both arguments. The next argument is in a language that doesn’t have one.

We have done this before. StochTree-Ex needed to evaluate every possible split point across 500 features for 200 trees, 200 iterations. Pure Elixir: 4.7 hours. Rust NIF with pre-sorted column indices: 2 minutes. A 133× speedup — not because Rust is fast, but because one NIF call processed an entire tree without crossing the boundary. The NUTS sampler taught the same lesson: NIF around the outer loop was slower (0.5x) because per-iteration boundary crossing ate the savings. NIF around the inner subtree was faster (1.5x) because it batched 4+ leapfrog steps per call.

The rule: Rust NIFs help when you can batch work inside the boundary. sim_ex’s event loop cannot batch — each event dispatches to a user-defined Elixir handle_event/3. A Rust NIF around the loop would cross the boundary 80,000 times per second, paying the same tax we just escaped from GenServer.

Unless the entities aren’t Elixir.

The DSL changes everything. seize :barber / hold exponential(16) / release :barber has no Elixir in the hot path. It is a state machine with known transitions, known distributions, known resource protocols. A compiler — not a NIF wrapper, a compiler — could translate the DSL to Rust: entity states as contiguous Vec, calendar as BinaryHeap, service times sampled by rand::distributions, the entire simulation in one NIF call. No boundary crossing per event. No garbage collection at all. Results come back as a binary blob, decoded once.

That is not an optimization of the Elixir engine. It is a different engine, generated from the same DSL, for the case where the model is large enough that the BEAM’s memory model becomes the bottleneck. The Elixir engine stays for interactive simulation, live dashboards, hot code reload, fault-tolerant distributed models — everything the BEAM was built for. The Rust engine takes over when you have 100,000 entities and a deadline.

Two runtimes, one DSL, zero compromises. StochTree-Ex proved the pattern: Elixir for orchestration, Rust for the inner computation. The question was never if Rust would enter the simulation engine. The question was what the boundary looks like. Now we know: it looks like a compiler, and the DSL is the interface.

Update: The Third Answer

We wrote this piece arguing about which data structure to use for entity state — Map or ETS — and which language to use for the inner loop — Elixir or Rust. We were optimizing the wrong dimension.

The single-threaded engine runs at load average 1.0 on 88 cores. Eighty-seven schedulers idle. But replications are independent — replication 1 has no dependency on replication 2. One thousand replications across 88 schedulers: 207 milliseconds in Rust, 683 milliseconds in Elixir. The analysis that was “rarely done” finishes before the slide deck loads.

The Map vs ETS question still matters for one run at 100K entities. But the question that matters for the plant manager is not “how fast is one run?” It is “how fast is the analysis?” And the answer is: thirty to one.

sim_ex is at github.com/borodark/sim_ex. Five engine modes: Map, ETS, Diasca, Parallel, Rust NIF. 120 tests (including property-based and adversarial statem). Parallel replications by default. The benchmark is in benchmark/full_bench.exs.