The 34% You Leave on the Table

This is Part 4 of "What If Probabilistic Programming Were Different?" Part 1 introduced the thesis. Part 3 covered the optimization journey from 34× slower to 1.9× faster. This part is about a single flag that recovered a third of our throughput.

The number was 0.73. Jobs per second. Forty-four concurrent MCMC sampling tasks on eighty-eight logical processors, and the system was producing 0.73 completions per second.

We had just finished a capacity planning exercise—the responsible kind, with tables and extrapolations. The table said we could handle 1,400 instruments on a 20-minute update cycle. What the table did not say was that we were measuring a system running at two-thirds capacity.

The Machine

Dual-socket Intel Xeon E5-2699 v4. Forty-four physical cores, eighty-eight threads. Two NUMA domains, each with its own memory controller. When a thread on socket 0 reads memory attached to socket 1, the latency roughly doubles. This is not a flaw. It is the cost of scaling beyond what one piece of silicon can address.

The BEAM virtual machine creates one scheduler thread per logical processor—eighty-eight schedulers. By default, these schedulers are unbound: the OS decides which core runs which scheduler, and it decides this hundreds of times per second based on criteria that have nothing to do with NUMA locality or cache warmth.

The Workload

Each NUTS sampling job takes about 5 seconds of wall time. The work is almost entirely inside EXLA's JIT-compiled XLA code: gradient evaluations, leapfrog integrations, tree building. The working set per job is roughly 3 kilobytes—a parameter vector, a gradient, a momentum vector, and 200 floats of observation data. This fits in L1 cache.

Cache was not the bottleneck. NUMA was.

One Flag

Erlang provides the +sbt flag to control scheduler binding. We benchmarked four strategies:

Strategy	Name	1 job (ms)	44 concurrent (j/s)
`u`	Unbound (default)	4,915	0.73
`db`	Default bind	4,129	0.87
`ts`	Thread spread	3,259	0.88
`tnnps`	Thread no-node processor spread	3,763	0.98

The unbound configuration—which we had been running in production for three weeks—was the slowest in every category. At forty-four concurrent jobs, tnnps delivered 34% more throughput.

What `tnnps` Means

The name is an Erlang-style compound that reads as three layered instructions:

thread — Spread across hardware threads first. Low-numbered schedulers get the first HT thread of each core; higher-numbered get the second. Prevents two schedulers from fighting over one core's execution units.
no_node — Fill one NUMA node completely before crossing to the next. Despite sounding like "ignore NUMA nodes," it means the opposite: respect node boundaries.
processor_spread — Within each NUMA node, spread across physical cores as widely as possible.

A minor etymological note: db (default bind) currently maps to tnnps. The Erlang/OTP team already decided this is the best general-purpose binding for bound schedulers. They just don't make it the default-default, because most BEAM workloads are I/O-bound web servers where scheduler migration is harmless.

For compute-bound numerical work—gradient evaluations, leapfrog integrations, tree building—the calculus inverts. Pinning wins because the cost of a NUMA remote access (40ns) exceeds the benefit of OS load balancing across schedulers that are all equally busy anyway.

The Capacity Impact

Instruments	Wall time per round	Fits in 20-min cycle?
100	1.4 min	Yes
500	7.2 min	Yes
1,000	14.4 min	Yes
1,400	20.1 min	Barely
2,000	28.7 min	Needs 40-min cycle

Without +sbt tnnps, the 1,000-instrument mark would require 19.2 minutes—leaving almost no headroom. With it, 1,400 instruments fit in a 20-minute window with room for variance.

How to Use It

elixir --erl "+sbt tnnps" -S mix run my_app.exs

Verify in a running system:

:erlang.system_info(:scheduler_bind_type)
#=> :thread_no_node_processor_spread

Which strategy to choose:

Single-socket (laptops, small servers): +sbt ts
Multi-socket NUMA (Xeon, EPYC): +sbt tnnps
Chiplet design (Ryzen 3000+): +sbt tnnps — CCDs are effectively NUMA nodes

The Lesson

The BEAM is designed for I/O-bound workloads—web servers, message brokers, telephony switches—where scheduler migration between cores is harmless because the bottleneck is waiting for packets, not computing gradients. When you repurpose it for CPU-bound numerical computation, the assumptions embedded in its defaults stop serving you.

Thirty-four percent is not a rounding error. It is a third of your capacity, donated to the operating system's scheduling heuristics because nobody asked the machine to do otherwise.

Full analysis with sources, benchmark scripts, and NUMA architecture details: docs/SCHEDULER_PINNING.md

Hardware: 2× Intel Xeon E5-2699 v4 @ 2.20GHz (44 cores / 88 threads), 256GB DDR4, dual-socket NUMA. Benchmark: CUDA_VISIBLE_DEVICES="" mix run benchmark/cpu_pinning_bench.exs