Vulkan on FreeBSD: the Proof — dataalienist.com

The previous post in this series — The GPU That Doesn’t Need CUDA — ended with a single elementwise add over a million floats, on a 2013 Mac Pro running FreeBSD, beating the BEAM’s binary backend by a factor of 1.66. A proof of concept. A claim that GPU compute on FreeBSD via Vulkan was a viable thing rather than a demonstration.

This post is the follow-up measurement. The same hardware. A real workload. Numbers that the first post did not have because the first post had nothing to run on the GPU yet.

What we set out to prove, in April 2026, was a sentence: the same Elixir code that runs CPU-only on a host without a GPU runs GPU-accelerated on a host with one, on FreeBSD via Vulkan. No CUDA. No driver wrappers. No “works on our machine.” A measurement, not a promise. This post is that measurement.

The hardware

Machine	GPU	Year	OS	Role
2013 Mac Pro	NVIDIA GT 750M (Kepler, 2 GB)	2013	FreeBSD 15.0	GPU compute node
2013 Mac Pro	NVIDIA GT 650M (Kepler, 1 GB)	2013	FreeBSD 15.0	Second GPU node
Custom workstation	NVIDIA RTX 3060 Ti (Ampere, 8 GB)	2021	Linux 6.8	Reference / dev

The FreeBSD machines are surplus. The GPUs are a decade old. They cost nothing. They run Vulkan 1.2 via FreeBSD’s nvidia-driver-470 package — the legacy series, the one NVIDIA has not bothered to deprecate because nobody complains.

What we built

nx_vulkan — an Nx tensor backend that dispatches compute to the GPU via Vulkan compute shaders. Written in Elixir + Rust (Rustler NIF) + C++ (Spirit’s Vulkan backend) + GLSL (the shaders themselves).

The key innovation, in the sense that this is the part that took real engineering rather than glue, is the fused leapfrog chain shader. Instead of dispatching twelve separate GPU operations per NUTS leapfrog step — the naive approach, what every machine-learning library does because it’s what the framework exposes — we wrote GLSL compute shaders that perform K=32 consecutive leapfrog steps in a single GPU dispatch. One fence wait instead of 12×32 = 384.

Six distribution families, each with a closed-form gradient baked into the shader:

Family	Shader	Gradient
Normal(μ, σ)	`leapfrog_chain_normal.spv`	`-(q-μ)/σ²`
Exponential(λ)	`leapfrog_chain_exponential.spv`	`1 - λ·exp(q)`
Student-t(ν, μ, σ)	`leapfrog_chain_studentt.spv`	`-((ν+1)/(νσ²))·(q-μ)/(1+z²/ν)`
Cauchy(loc, scale)	`leapfrog_chain_cauchy.spv`	`-2(q-loc)/(scale²+(q-loc)²)`
HalfNormal(σ)	`leapfrog_chain_halfnormal.spv`	`1 - exp(2q)/σ²`
Weibull(k, λ)	`leapfrog_chain_weibull.spv`	`k·(1-(exp(q)/λ)^k)`

Each shader is roughly eighty lines of GLSL. Single workgroup. Shared-memory reduction for per-step log-probability. Push constants carry the distribution parameters. Output: K×n position, momentum, gradient chains plus K log-probabilities. One dispatch.

The race

We ran eXMC’s NUTS sampler — one thousand warmup iterations followed by one thousand sampling iterations, five seeds per cell, the median reported — across five distribution families on all three machines.

FreeBSD GT 750M, post-fix

Model	Wall (ms)	ESS/s
Normal(0, 1) d=1	1,023	418.6
Exponential(2) d=1	1,032	572.0
Student-t(df=3) d=1	1,043	232.0
HalfNormal(1) d=1	1,144	229.2
Weibull(k=2, λ=1) d=1	1,129	350.7

One second per model. Two thousand NUTS iterations. On a GPU from 2013 that originally shipped in a laptop.

Linux RTX 3060 Ti, post-fix

Model	Wall (ms)	EXLA → Vulkan ratio
Normal d=1	1,311	0.66
Normal d=8	1,399	1.22
Normal d=50	1,698	3.17
Exponential	1,893	0.84
Student-t df=3	1,342	1.04
HalfNormal	2,031	0.49
Weibull k=2	1,807	0.91

Vulkan beats EXLA on four of seven cells. At d=50 it’s 3.17 times faster. At d=1, EXLA’s CUDA path has lower per-call overhead — CUDA’s driver is optimized for throughput, not latency. As dimensionality grows, the chain shader’s per-thread parallelism scales linearly while EXLA’s per-call overhead stays constant. Crossover lives somewhere around d=20-30. At d=50 the question is settled.

The surprise

The GT 750M on FreeBSD consistently outperforms the RTX 3060 Ti on Linux in wall-clock time. An older, weaker GPU on a niche operating system beats a modern flagship GPU on the mainstream operating system that flagship GPU was designed for. That is not what anyone expected and it is not how anyone advertises it. So we instrumented the Vulkan dispatch path with per-fence timing — atomic counters wrapped around vkQueueSubmit and vkWaitForFences — and measured.

Phase	FreeBSD GT 750M	Linux RTX 3060 Ti	Ratio
`vkQueueSubmit`	11.6 μs	138 μs	12×
`vkWaitForFences`	406 μs	1,130 μs	2.8×
Command record	4.3 μs	19 μs	4.4×
Per-dispatch total	422 μs	1,287 μs	3.1×

The GPU compute is the same speed. The driver overhead is not.

FreeBSD’s NVIDIA Vulkan driver completes fence waits in 406 microseconds. Linux’s NVIDIA driver, on the same version family, takes 1,130 microseconds. The submit call itself is twelve times faster on FreeBSD — not measurably faster, twelve times faster — and the command-record path is more than four times faster. None of this is a kernel optimization or a hardware difference. Same GPU silicon class. Same Vulkan API. The driver path on FreeBSD goes through the FreeBSD kernel’s fence/sleep mechanism, which has lower latency than Linux’s for short GPU dispatches. For a workload that does thousands of short dispatches per second — which is exactly what MCMC sampling is — this compounds into a 3× wall-time advantage.

There is a temptation, when the unfashionable platform measurably outperforms the fashionable one on a workload everyone assumes the fashionable one owns, to read the result as somehow politically charged. It is not. The proprietary NVIDIA Linux driver is among the finest pieces of software in the industry. It is also tuned for the common case, which is large transparent batches and long-running training loops, not short Bayesian dispatches. The FreeBSD driver, having been built less and optimized less, happens to do the short-dispatch case better. The benchmark workload here is neither common nor what either driver was tuned for, and the surprise is mainly that one of the two drivers handles it better than the other.

The fused-versus-unfused speedup

On FreeBSD GT 750M, same workload, same GPU:

Path	ms / iter	Speedup
Unfused (per-op dispatch)	283	1×
Fused chain (K=32)	3.3	86.7×

The chain shader reduces approximately 384 fence waits to roughly 4. At 406 microseconds per fence, that is 155 milliseconds saved per iteration. The remaining 3.3 milliseconds is the actual GPU compute plus four fence waits. The dispatch overhead is everything; amortizing it across K steps is the entire optimization.

K-sweep on the same GPU

Normal(0, 1), d=8:

K	μs / dispatch	μs / step
1	553	553
2	411	206
4	423	106
8	421	53
32	440	13.8
128	619	4.8

At K=32, per-step cost is 13.8 microseconds. At K=128 it’s 4.8. The dispatch overhead is amortized across K steps; the GPU compute per leapfrog step is sub-microsecond. The fence wait, not the math, is what we are paying for, and the chain shader is the mechanism for paying it once instead of K times.

The stack

Elixir (eXMC NUTS sampler)
  ↓ Nx.Defn.Compiler
Nx.Vulkan.Backend (Elixir)
  ↓ Rustler NIF
nx_vulkan_native (Rust)
  ↓ extern "C" FFI
nx_vulkan_shim.cpp (C++)
  ↓ spirit Backend_par_vulkan
Vulkan API (vkQueueSubmit)
  ↓ NVIDIA driver
GPU hardware (SPIR-V compute shader)

Seven layers. The SPIR-V binary is identical on both platforms. The Elixir code is identical. The Rust NIF is identical. The C++ shim is identical. The GLSL source is identical. The only difference between the FreeBSD run and the Linux run is the kernel and the driver underneath the Vulkan API. Every layer above that is the same artifact.

What we proved

Five things, in order of how surprising each one is.

One. FreeBSD plus Vulkan is a viable GPU compute substrate for production Bayesian inference. Not just “it compiles” — it runs two thousand NUTS iterations across five distribution families in roughly one second on a decade-old GPU. The workload is real, the result is reproducible, and the hardware is older than the iPhone with Touch ID.

Two. Fused chain shaders are the right architecture for MCMC on Vulkan. 86.7× speedup over per-op dispatch. The insight is not exotic: GPU compute is cheap, fence waits are expensive, amortize the fence across K steps. The work is in the GLSL.

Three. FreeBSD’s NVIDIA Vulkan driver has 3× lower per-dispatch latency than Linux’s. Not a GPU difference; a driver synchronization difference. Measured, not theorized.

Four. Vulkan beats EXLA at high dimensionality. At d=50, Vulkan is 3.17× faster than EXLA on the same Linux RTX 3060 Ti. The chain shader’s per-thread parallelism scales better than EXLA’s per-call CUDA overhead, and at some point the linear scaling crosses the constant overhead. That point is around d=20 to d=30; at d=50 the question is no longer interesting.

Five. The entire stack — from GLSL shader to Elixir mix test — works on FreeBSD out of the box. 178 tests, 0 failures. pkg install vulkan-loader erlang rust && mix compile && mix test. That is the walkable path. We had been calling it the walkable path before we had the numbers; now we have them.

What this opens up

The point of the original post was that the exit door from the CUDA platform-lock has been there the whole time, and it is called Vulkan, and you can walk through it on whatever operating system you prefer. The point of this post is to say that the door is not just unlocked; it leads somewhere.

A 2013 Mac Pro that you could buy on eBay for two hundred dollars runs Bayesian inference, with eXMC’s NUTS sampler, against six distribution families, faster than a 2021 Linux workstation with a flagship NVIDIA GPU. The bottleneck on Linux is the NVIDIA Vulkan driver’s per-dispatch latency. The bottleneck on FreeBSD is the speed of light through silicon. They are not the same kind of problem.

What is next, in roughly the order it gets done: a dual-GPU demo across the two FreeBSD Mac Pros, dispatching sampling jobs over distributed Erlang and merging results without an orchestrator; JIT codegen from Nx.Defn expression trees to GLSL, which makes any Nx-using model GPU-accelerated on FreeBSD without writing a single SPIR-V byte by hand; multi-workgroup chain shaders to lift the n ≤ 256 constraint and let high-dimensional models join the party.

None of this requires CUDA. None of this requires Linux. None of this requires that anybody at NVIDIA decide it’s worth their time. The decisions have already been made. The hardware is already there. The shaders are already written. The walkable path under the mountain has now been walked, and the measurements are attached.