Four engines on the BEAM. Continuous parameters, discrete states, unknown functional forms, simulated systems — each solved by the right algorithm, all sharing one runtime where liveness, fault tolerance, and concurrency aren't features but consequences.
Three architectural consequences of building a probabilistic programming framework on a virtual machine designed for telephone switches.
sample_stream sends each posterior draw as a BEAM message to any process — a Scenic window, a Phoenix LiveView, a GenServer. The sampler doesn't know or care what the receiver does.
Sampler.sample_stream(model, self(), init, opts)
receive do
{:exmc_sample, i, point_map, step_stat} -> ...
end
Each subtree build is wrapped in try/rescue. Zero-overhead on the happy path — BEAM's try is exception registration in the process stack. When a subtree crashes, it's replaced with a divergent placeholder.
try do
build_subtree(state, direction, depth, step_fn)
rescue
_ -> divergent_placeholder(state)
end
Distribution is just sending messages farther. Start nodes, send the model, collect samples. Erlang's distribution protocol handles serialization. PIDs are location-transparent.
:erpc.call(node, Sampler, :sample_stream,
[model, coordinator_pid, init, opts])
Head-to-head against PyMC. 5-seed medians, 1 chain, same model, same data.
| Model | PyMC ESS/s | eXMC ESS/s | Ratio | Winner |
|---|---|---|---|---|
| simple (d=2) | 576 | 469 | 0.81× | PyMC |
| medium (d=5) | 157 | 298 | 1.90× | eXMC |
| stress (d=8) | 185 | 215 | 1.16× | eXMC |
| eight_schools (d=10) | 5 | 12 | 2.55× | eXMC |
| funnel (d=10) | 6 | 2 | 0.40× | PyMC |
| logistic (d=21) | 336 | 69 | 0.21× | PyMC |
| sv (d=102) | 1 | 1 | 1.20× | eXMC |
eXMC wins 4 models to PyMC's 3, including the canonical Eight Schools benchmark (2.55×) and 102-dimensional stochastic volatility (1.20×). With 5-node distribution: 2.88× average scaling.
Four engines. One runtime. Each solves a different shape of uncertainty.
Your parameters are continuous. You wrote the likelihood. You need a posterior.
NUTS sampler with Stan-style warmup. 21 distributions. Warm-start for streaming updates. Beats PyMC on 4 of 7 benchmarks.
{trace, stats} = Sampler.sample(model, init,
num_warmup: 1000,
warm_start: prev_stats)
Nx + EXLA. 337 tests.
States are discrete. Data streams in. The model updates, not restarts. No gradient required.
Online SMC² with parallel rejuvenation on all cores. Beats Chopin's own Python library on 5 of 7 benchmarks. Zero dependencies.
result = SMC.run(seir_model, prior, cases,
n_theta: 400,
parallel: true)
Pure Elixir. Zero deps. 10 tests.
200 features. Unknown functional form. Which ones matter? Linear? Threshold? Interaction?
BART via Rust NIF. ForestTracker sorted indices: 133× speedup. Better RMSE than Python on 6 of 7 tests.
{forest, _} = StochTree.BART.fit(x, y,
num_trees: 200)
importance = StochTree.variable_importance(forest)
Elixir + Rust NIF. 14 tests.
Entities are processes. Events are messages. The supervision tree is the system model. Fault tolerance is free.
Discrete-event simulation engine. 539K events/sec. PHOLD benchmark. Tight-loop engine bypasses GenServer in the hot path.
{:ok, result} = Sim.run(
entities: [{:server, Sim.Resource, config}],
initial_events: [{0.0, :src, :generate}],
stop_time: 50_000.0)
Pure Elixir. Zero deps. 11 tests.
| Your problem | What you hire | Why |
|---|---|---|
| Process drifting, control chart blind | eXMC + smc_ex | Bayesian SPC with live changepoint detection |
| 200 sensor columns, no model | StochTree-Ex | BART discovers which features matter |
| Epidemic spreading, cases arriving daily | smc_ex | O-SMC² updates β, σ, γ in real time |
| When will this bearing fail? | eXMC + StochTree-Ex | Weibull survival + nonlinear degradation |
| Regime shift — trending or mean-reverting? | eXMC + smc_ex | NUTS for continuous params, O-SMC² for discrete states |
| Factory model drifting from reality | sim_ex + smc_ex + eXMC | DES simulation with self-calibrating inputs and posterior uncertainty |
Four libraries. Four mix.exs entries. No shared dependencies. They compose in the same application because they share one thing: the BEAM.
Here's the result I'm most proud of, because it wasn't designed — it was discovered.
Streaming inference was built in January: a sampler that sends each posterior draw as a BEAM message, so a Scenic visualization window can update trace plots in real time. Distribution was built in February: four Erlang nodes, each running an independent chain, collecting samples through :erpc.
The two features were developed months apart. Neither knew the other existed. When we connected them, the change was three lines of code. The distributed coordinator already forwarded messages to a caller PID. The streaming visualization already listened on a PID for sample messages. We pointed one at the other.
Four nodes. Twenty thousand samples. Twenty-one seconds. The Scenic dashboard updated live from all four nodes simultaneously — trace plots interleaving draws from four independent chains running on four separate machines.
This composition worked on the first attempt because both features were built on the same primitive: send(pid, {:exmc_sample, i, point_map, step_stat}).
There is a peculiar constraint at the heart of the dominant language in scientific computing: it cannot truly do two things at once. A global lock, woven into the interpreter's memory model decades ago, ensures that threads yield to one another rather than run alongside. Every framework that requires concurrency — and probabilistic programming requires a great deal of it — must work around this fact, layering foreign runtimes beneath the surface.
The BEAM virtual machine has no such constraint. It was built for telephone switches, systems where dropping a call is not an option and where a million conversations must proceed simultaneously without interference. What if we took that machine — designed for reliability under concurrency — and asked it to explore posterior distributions instead of routing phone calls?
Read the full essay →Four layers, each depending only on the one below.
| Layer | Modules | Responsibility |
|---|---|---|
| IR | Builder, DSL, Dist.* | Model as data |
| Compiler | Compiler, PointMap, Transform | IR → differentiable closure |
| Sampler | NUTS, ADVI, SMC, Pathfinder | Inference algorithms |
| Runtime | Streaming, Distribution, Viz | BEAM integration |
Three ways to define and sample a model.
alias Exmc.{Builder, Dist.Normal, Dist.HalfNormal}
ir = Builder.new()
|> Builder.rv("mu", Normal.new(0.0, 10.0))
|> Builder.rv("sigma", HalfNormal.new(1.0))
|> Builder.obs("y", Normal.new("mu", "sigma"), data)
{trace, stats} = Exmc.Sampler.sample(ir, %{},
num_samples: 1000, num_warmup: 1000)
import Exmc.DSL
model = model do
mu ~ Normal.new(0.0, 10.0)
sigma ~ HalfNormal.new(1.0)
y ~ Normal.new(mu, sigma), observed: data
end
Exmc.Distributed.sample_chains(model, init,
nodes: [:"n1@10.0.0.2", :"n2@10.0.0.3"],
num_samples: 1000, num_warmup: 1000)
Add to your mix.exs dependencies:
{:exmc, "~> 0.1.0"}
The thesis behind the framework.
Probabilistic Programming on BEAM Process Runtimes
Notes on building four engines — three for inference, one for simulation — where none were expected.
A state-of-the-project post for May 2026. The pièce de résistance is a single DSL verb — host — that turns hand-rolled :rpc.call sequences into declarative multi-host deploys, backed by a TLA+-verified two-phase commit protocol. Twelve lines of DSL produce a 100-millisecond two-host coordinated deploy across two Erlang-distributed FreeBSD Mac Pros. Plus the satellites: probnik_qr (planned), nx_vulkan (Phase 2 shipped), exmc (NUTS on GPU), spirit (vendored), and the writing surface this post sits on. Six repositories, one BEAM cluster, three operating systems, two GPUs from 2013, and a deliberate refusal to call any of it a platform.
Two FreeBSD Mac Pros, one Erlang cluster, a chaos test that surfaced exactly the kind of bug a coordinated deployment tool must never ship: a dataset survived on mac-248 when mac-247 failed mid-converge. The fix was a 200-line TLA+ specification of the protocol — and the model checker found a second bug five rounds of manual chaos testing had missed. The spec runs in 0.8 seconds. The bug had cost half a day. Zed’s coordinated converge is now a TLA+-verified 2-phase protocol with three invariants. Write the spec first.
A follow-up measurement to The GPU That Doesn’t Need CUDA. Two 2013 Mac Pros with GT 750M and GT 650M GPUs running FreeBSD beat a 2021 Linux workstation with an RTX 3060 Ti at Bayesian inference. Per-fence instrumentation explains why: FreeBSD’s NVIDIA driver completes vkWaitForFences in 406 µs; Linux’s takes 1,130 µs on the same driver family. Fused leapfrog chain shaders amortize the dispatch overhead 86× over per-op dispatch. One second per 2000 NUTS iterations on a decade-old GPU. The walkable path has now been walked, and the measurements are attached.
Three workstreams just converged. nx_vulkan reached 152/0 on Linux RTX 3060 Ti. The Bastille adapter shipped to zed. The eXMC NUTS sampler runs against a Vulkan backend with no code changes. Together that means a sentence is now true that wasn’t in February: write a probabilistic model in Elixir, ship it to a FreeBSD jail managed by zed, and have it sample on the host’s GPU through Vulkan compute pipelines — no CUDA anywhere in the chain. The mountain of CUDA sophistication is still there. The point was never to climb it. The point was to demonstrate we don’t have to.
From `mix new` to a working Nx tensor backend on Vulkan in seven commits. The bootstrap was the easy part. The hard part was the moment 100 BEAM processes simultaneously called `vkQueueSubmit` and the NVIDIA driver responded by losing the device. What the Vulkan spec said, what the Erlang scheduler did, and what one `Mutex<()>` in Rust bought back. Plus the speedup number nobody who runs FreeBSD will quite believe — three thousand times over Nx.BinaryBackend at four million elements, zero leaked megabytes after thirty seconds of churn, and what falls out the other side when both bugs are interesting.
A Vulkan compute backend for the Spirit spin simulator, built on a 2013 Mac Pro running FreeBSD with a GT 750M, then ported to a Linux box with an RTX 3060 Ti. The 750M proved the FreeBSD path works (1.66x at 1M); the 3060 Ti proved it scales (8.8x at 1M, 13.3x at 4M). Same backend code, same shader, same 540 lines of C++. What it opens up: Nx.Vulkan for Elixir GPU on BSD, GPU-inference jails on FreeBSD, an exit door from the CUDA platform-lock that’s been there the whole time.
A reintroduction to Zed after A0 through A5a + A1.rotate landed. Three thousand lines of Elixir, eighty lines of C, one filesystem, no etcd. ZFS user properties as the state store, encrypted secrets with archived rotation, a Bastille adapter that catches the upstream CLI’s lies, and a process-boundary privilege model where zedweb cannot reach root. What it gives you, what’s not yet in the box, and how it’s tested.
A 540-line Elixir adapter to FreeBSD’s Bastille jail manager. 175 mocked unit tests passed cleanly on a Linux laptop. The first live run on real FreeBSD hardware found seven distinct failures in sequence — none of which any mock could have predicted. What the mock did not know, what the post-condition check finally caught, and why the punchline of an adapter is the line where it stops trusting the tool it adapts.
A FreeBSD NAS I will probably never build. The design arc from rebuild fantasy to a 14-person-month MVP, and what the detour taught about secret hygiene on the BEAM: ZFS user properties as metadata backbone, QR-paired admin login, a phone-resident vault that implements its own Shamir, and the seven ways to lose your keys for good.
The official BDA3 Python demos, ported to Elixir Livebooks. 22 demos across 8 chapters. 13 Stan files translated. 8 datasets vendored. Every numerical claim verified. Same pedagogy, different runtime virtues.
Load average 1.0 on 88 cores. One scheduler working, eighty-seven watching. Until you run 1,000 replications — then they all work, and the analysis takes 207 milliseconds.
Fourteen functions met a pattern-matching rule. Twenty-one call sites met a pipe rule. The two I didn’t refactor and the one I didn’t pipe are the real deliverable — the taxonomy of when NOT to.
A factory floor has three kinds of inspectors. The first two are familiar. The third is the one that finds the bug. Testing simulation engines, explained for industrial engineers.
Two NUTS sampler bugs took four months to find by hand. Every test passed. The chain converged at half the speed it should have. The proper_statem models that catch them are eight hundred lines — The Mechanic and the Transporter — and the postcondition is a histogram, not an invariant.
700 adversarial command sequences. Three state machine models. One bug that 114 tests missed: you can release a resource you never held, and the engine says thank you.
Seed 42 is an anecdote. 350 trials across random parameter spaces proved Little’s Law, flow conservation, and determinism. They also found a bug that 77 point tests missed for four months.
Elixir sequential is slower than SimPy. Elixir parallel is 9x faster. Rust parallel is 30x. The BEAM’s advantage is not per-event speed — it’s per-replication concurrency. 1,000 reps in 207 milliseconds.
At 100,000 entities, Elixir’s persistent Map creates six garbage nodes per event. ETS doesn’t. But the crossover isn’t where you’d expect — it depends on whether you pin your BEAM schedulers to NUMA nodes. A story about garbage.
Building a discrete-event simulation engine on the BEAM. 539,000 events per second. The actor model is the right abstraction for the system, the wrong abstraction for the inner loop. On why GenServers cost 7.8× in the hot path.
5.8× NUTS warmup speedup by reusing the previous posterior's mass matrix. When the data barely changes, so does the posterior geometry. One keyword argument.
How one Erlang flag turned 0.73 jobs/sec into 0.98 on 88 cores doing Bayesian MCMC. On NUMA-aware scheduler binding, the naming of tnnps, and why the BEAM's defaults are wrong for compute-bound work.
On the improbable and slightly reckless decision to build a probabilistic programming framework on a virtual machine designed for telephone switches. A literary technical memoir in twelve chapters.
Seven optimization phases, each revealing something about mixed-runtime Probabilistic Programming design. From 3.3 ESS/s to 298 ESS/s without changing the algorithm.
16 distributions, 4 inference methods, GPU acceleration, and a head-to-head race against PyMC. The numbers that prove the thesis.
The architectural thesis: streaming, fault tolerance, and distribution as consequences of the runtime, not features bolted on. Part 1 of the series.