What the Mutex Saved — dataalienist.com

The previous post on this site (The GPU That Doesn’t Need CUDA) ended with a sentence that, in retrospect, made a promise. It said the Vulkan compute backend Spirit had grown was the foundation, and that everything else was mechanical assembly. The rest of the iteration would be the Rustler NIF wrapper, the Nx.Backend interface, the integration tests. Mechanical.

That post came out on a Wednesday. By Saturday the wrapper was real, ran on real hardware, passed real tests, and had — in the course of having those things — surfaced two production bugs that had nothing to do with assembly and everything to do with what happens when a vendor specification quietly assumes you know about something the documentation file you actually read forgot to mention. This is the story of those bugs, and of the line in Rust that closed the second one.

The bootstrap was, in fact, mechanical

The repo is called nx_vulkan. The structure is the predictable shape: an Elixir lib/ with a top-level module and an Nx.Backend implementation; a Rust native/ crate that uses Rustler; a c_src/ directory with a flat C ABI shim that translates between Rust’s extern "C" and Spirit’s C++. The decision to wrap Spirit’s backend through a C shim rather than try to bind C++ from Rust directly came in about ninety seconds — bindgen does not handle C++ namespaces and STL types cleanly; a thirty-line C wrapper does.

The first commit on the new repo was the plan. The second was the project skeleton. The third was the moment the Elixir test runner called into Rust which called into C++ which called into Vulkan which actually returned the device name of the RTX 3060 Ti. By the fifth commit, you could write

iex> t = Nx.tensor([1.0, 2.0, 3.0, 4.0], backend: Nx.Vulkan.Backend)
#Nx.Tensor<f32[4]
  Nx.Vulkan.Backend
  [1.0, 2.0, 3.0, 4.0]
>

iex> Nx.add(t, t) |> Nx.to_flat_list()
[2.0, 4.0, 6.0, 8.0]

iex> Nx.exp(t) |> Nx.to_flat_list()
[2.7182, 7.3890, 20.0855, 54.5981]

iex> Nx.sum(t) |> Nx.to_number()
10.0

Thirty-seven tests passed. The seven Vulkan compute shaders Spirit had grown — elementwise binary, elementwise unary, reductions, naive matmul, tiled matmul, broadcasting elementwise binary, and Philox+Box-Muller random — were now reachable from Elixir as boring named functions. Nx.Defn graphs that used only those operators dispatched correctly. Anything else — slices, sorts, conv, autograd — fell back to Nx.BinaryBackend through Nx’s built-in compatibility shim.

This was, as promised, mechanical. So the next thing to do, on a Saturday afternoon when the build was green and the suite was green, was to break it.

The race

The breakage hunt was a single Elixir script. It did six things: race a hundred concurrent processes through Nx.Vulkan.add/2; allocate fifty thousand tiny tensors in a tight loop and watch what happened to VRAM; scale up tensor size until the upload returned an error or the box died; benchmark Vulkan against the BinaryBackend at every meaningful tensor size; produce a divide-by-zero and inspect what came out of the GPU; and run a matmul ladder from 128² to 2048² to find where the GFLOPS curve flattened. The script took about two minutes to write. The first section took less than thirty seconds to fail.

spirit-vulkan: wait (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
... (about 800 more lines of this)

VkResult=-4 is VK_ERROR_DEVICE_LOST. The NVIDIA driver had killed the device. Everything submitted after the first failure also failed, because the device was lost. Once that happens, the only path back is process exit and a fresh vk_init next run.

This is, as failures go, a clean one. The hardware did not catch fire. The kernel did not panic. The driver simply marked the context dead and refused further work. But the question of why the driver had marked the context dead — that took a few minutes of reading.

The Vulkan specification has a section on threading semantics that nobody loves. It is the section that says, of every Vulkan handle, whether multiple host threads can touch it at the same time. Most handles are fine — descriptor pools, command pools, buffers, all permit concurrent access. The exception is VkQueue. The spec describes VkQueue as externally synchronized, which is the standards-document way of saying “if more than one thread calls vkQueueSubmit on the same queue without your own lock around it, you are off the map and we make no promises about what happens.”

What had happened, in practice, was that one hundred Elixir processes had each run their own add through the NIF; the BEAM scheduler had spread them across all eighty-eight hardware threads on the host; eighty-eight threads had simultaneously called vkQueueSubmit on the single global queue Spirit’s compute backend uses; and the NVIDIA driver, finding its internal queue state corrupted by the unsynchronized access, had concluded — quite reasonably — that the safest response was to mark the device dead and return DEVICE_LOST until someone restarted.

The Vulkan spec was right. The driver was right. The error code was right. The thing that was wrong was a missing line of Rust:

static SUBMIT_LOCK: Mutex<()> = Mutex::new(());

And one matching let _g = SUBMIT_LOCK.lock()?; at the top of every NIF that submits. Six functions, twelve added lines, no other code change. The mutex serializes every host-side vkQueueSubmit. Vulkan’s threading rule is satisfied. The same hundred-process race finishes in the time the scheduler takes to run them sequentially, with all hundred processes returning :ok and every result correct.

This is not the kind of fix that loses real performance. Spirit’s submit_and_wait is itself blocking — it submits the command buffer and then waits on a fence before returning. Concurrent submits to the same queue do not actually overlap on the GPU; they are processed in submission order regardless of whether the host serialized the calls. The mutex costs us only the wall-clock contention on the host side, and the host side was already going to wait. What the mutex saves is the device.

Which is the kind of thing you write when you discover you have been violating the specification of the operating system you are calling, and the operating system has a sense of humor.

The empty list

The next breakage took longer to find because it looked like nothing at all. The script did Nx.Vulkan.divide([1.0, 2.0, 3.0], [0.0, 0.0, 0.0]) and downloaded the result and printed it. The expected output was [Inf, Inf, Inf]. The actual output was [].

The empty list. Not a NaN. Not an infinity. Not an error. The download returned a binary of the correct size; the next line of Elixir tried to decode that binary into a list of floats; and the list it produced was empty.

The shape of the bug was visible to anyone who has spent time parsing IEEE 754 in Erlang. The Elixir-side decoder for downloaded tensors was a straightforward binary comprehension:

for <<x::float-32-native <- bin>> do
  x
end

This is the canonical Elixir way to peel f32 values out of a packed binary. It works for finite floats. It works, beautifully, for forty-seven different Erlang releases going back ten years. It does not, as the Erlang documentation phrases it in language designed to be exactly as helpful as possible, “match values that are not floating-point numbers.” What this means in practice is that the bit patterns IEEE 754 reserves for NaN, +Inf, and -Inf — exponent field all-ones, with mantissa zero or non-zero respectively — fail to bind in the comprehension. The matcher silently drops them. A binary of three NaNs becomes a list of zero floats.

The GPU had done its job. The C++ shim had downloaded the right bytes. The wire was clean. The decoder, written in the most idiomatic Elixir possible, ate non-finite results without comment.

The fix is a function that decodes the raw 32-bit pattern, checks the IEEE 754 exponent field for the all-ones case, and returns :nan, :infinity, or :neg_infinity as atoms. (This matches the convention the rest of the BEAM ecosystem uses for non-finite floats; Nx itself returns these atoms from to_number/1 on infinite tensors.) Twenty lines of code; one passing test; the divide-by-zero now reads [:infinity, :neg_infinity, :nan] and the breakage hunt moves on.

The number

With the mutex in and the decoder fixed, the speed section of the breakage hunt finally produced its number. The number is what Nx.Vulkan.Backend buys you over Nx.BinaryBackend on a single Nx.add:

N=1024     CPU=  1.5 ms   GPU=  0.3 ms      4.58×
N=16K      CPU= 34.6 ms   GPU=  0.5 ms     72.93×
N=256K     CPU=569.5 ms   GPU=  2.4 ms    235.94×
N=1M       CPU= 2.84 s    GPU=  1.2 ms   2310.88×
N=4M       CPU=11.24 s    GPU=  3.8 ms   2994.00×

Three thousand times. At four million elements, the BEAM-pure Elixir loop takes eleven seconds. The same operation on the GPU takes three milliseconds. The number is so absurd it deserves careful framing.

The CPU baseline is Nx.BinaryBackend, which is a single-threaded list iteration through the BEAM. It is not SIMD. It is not OpenMP. It is not the optimised C compute kernel that EXLA’s host backend would use. It is the backend Nx ships by default because it is the only one that runs on every platform without a compiler. On any reasonable benchmark, it is the slowest of the available options.

The fair comparison would be against EXLA on its CPU backend, which would narrow the gap to something like ten or twenty times. The fair comparison cannot be performed because EXLA does not build on FreeBSD, and the entire point of Nx.Vulkan is that EXLA does not build on FreeBSD. So the comparison Nx.Vulkan-versus-BinaryBackend is the honest comparison: what an Elixir program on the platform we care about actually gets when it switches its default backend. The answer is, depending on tensor size, between five and three thousand times faster.

The three-thousand-times datapoint is also where the Nx.Defn story lights up. A defn function that does even a few hundred thousand-element ops will move from several seconds to a few milliseconds purely by adding

Nx.Defn.default_options(default_backend: Nx.Vulkan.Backend)

to its setup. No code change in the function. No change to the graph the compiler emits. The same backend the rest of the BEAM ecosystem expects to plug into Nx, plugged in.

The adversarial round

The mutex closed the obvious bug. The decoder closed the obviously-empty bug. The breakage hunt’s second pass — done once both fixes had landed — went after the bugs that were not obvious. It found none.

The pass that mattered was the one that monitored nvidia-smi across thirty seconds of one-megabyte allocate-and-free cycles. Six thousand and ninety-nine cycles later, VRAM utilisation was unchanged from baseline. Not within some tolerance — exactly the same number of megabytes used. The ResourceArc<VulkanTensor> in Rust, with its Drop impl that calls into the C shim’s nxv_buf_free which calls into Spirit’s buf_free which calls into the Vulkan loader’s vkFreeMemory, was tight enough that six thousand allocations did not leak so much as a megabyte.

The pass that mattered next was the one that spawned two hundred BEAM processes, each holding a ten-megabyte GPU tensor, and then killed each process mid-tensor with Process.exit(pid, :kill). This is the kind of operation that, in a system written in C, would leak two gigabytes of GPU memory in fifteen seconds. In nx_vulkan, two gigabytes of allocations across two hundred dying processes produced a VRAM delta of exactly zero megabytes. Erlang’s process-exit GC found the resource references; Rust’s Drop impl ran; the GPU buffer was released; the device kept count.

This is what BEAM-side GPU bindings should look like, and it is not what they typically do. The combination of Rustler’s ResourceArc and Rust’s Drop trait is not novel. What is novel — for those who have spent any time in the OpenCL or CUDA-via-FFI literature — is how clean the interaction is when both sides do their job. The Erlang VM guarantees that resources outlive their references. The Rust type system guarantees that destructors run when references go out of scope. The combination guarantees that GPU buffers cannot leak without simultaneously breaking either Erlang’s memory model or Rust’s ownership rules. The combination, in practice, does not break.

The other adversarial passes were anti-climactic. Pushing VRAM toward the eight-gigabyte ceiling on the 3060 Ti returned {:error, :alloc_failed} at three and a half gigabytes of held buffers, with the GPU staying alive across the failure. Off-by-one workgroup boundaries (N=255, N=256, N=257, N=511, N=512, N=513, ...) all round-tripped correctly. Wrong-shape matmul calls — claiming a 1000×1000 product from four-element inputs — returned {:error, :size_mismatch} from the Rust-side check before reaching the GPU. The empty tensor returned a clean error. The ten-thousand-element matmul returned the right answer.

Forty-two tests pass.

What this is

nx_vulkan at v0.0.7 is a working tensor backend for Nx, on Vulkan, on FreeBSD, with f32 elementwise math and transcendentals and reductions and naive-and-tiled matmul and Philox-with-Box-Muller random. Its surface is small. The full Nx.Backend behaviour has roughly seventy callbacks; this implementation covers the twenty or so that map to Spirit’s seven shaders. The remaining fifty fall back to Nx.BinaryBackend automatically and produce a deprecation warning at compile time, which is the right shape for a partial backend.

What it is not is feature-complete. It does not do autograd. It does not do mixed precision. It does not do per-axis reductions, only all-axis. It does not do convolution. It does not do sorts. The roadmap to v0.1 closes some of these gaps; the roadmap to v1.0 closes the rest. None of them are interesting relative to the question this post is actually about.

The question this post is about is whether GPU compute on FreeBSD, via Vulkan, via Spirit, via Rustler, into Nx, ends up being a real tool or a demonstration. The bench numbers say something. The lifetime tests say more. A backend that does not leak under thirty seconds of churn, that does not crash when processes die mid-tensor, that does not destroy the device under contention — that is closer to a tool than a demonstration. Three thousand times faster than the reference baseline at the size where speedups matter.

Which means the next thing to write is not more shaders. The next thing to write is the Hex package release notes, the documentation, the migration guide for the trader who will switch Application.put_env(:nx, :default_backend, ...) from {Nx.BinaryBackend, []} to {Nx.Vulkan.Backend, []} and want to know what changed. The Spirit Vulkan substrate has a tensor language on top of it. The tensor language has a wrapper. The wrapper survives adversarial testing. The thing now is what the people who will use it can do with it.

Coda

The first post in this series said that Vulkan was the door. The second said the door was unlocked. This one says: we have walked through the door, looked at the room, and the room is the size we needed it to be. Three thousand times the speed at four million elements. Zero leaked megabytes after thirty seconds of churn. Zero leaked megabytes after two hundred process deaths. Five days from mix new to a green Hex-shaped repository on a NAS at 192.168.0.33.

The two bugs we found were both interesting. The race was interesting because the Vulkan specification was right and the operating-system rule was clear and our code violated it without any of us noticing for five days. The decoder was interesting because the most idiomatic Elixir construction in the language silently dropped the most important class of edge values.

What both bugs share is the shape of error CUDA programmers do not have to think about. CUDA hides queue management; the specification does not say externally synchronized because CUDA does not expose a queue at all. CUDA tensors do not need an Elixir-side decoder because their CPython bindings returned numpy arrays decades ago. The mistakes that nx_vulkan made are mistakes about Vulkan’s contract and Erlang’s pattern matcher specifically — the two things between us and the GPU. They were not mistakes about GPU programming. They were mistakes about the seam.

The seam is also the part nobody else has built. EXLA does not build on FreeBSD. EMLX does not build on anything except Apple Silicon. There has been, until this week, no Nx tensor backend on FreeBSD, on AMD GPUs, on Intel iGPUs, on every machine that ships a Vulkan loader and a driver and a couple of GB of VRAM. There is now. It is called nx_vulkan. It is at github.com/borodark/nx_vulkan. It survives 1500 lines of test code and a thirty-second VRAM-watch and two hundred BEAM process deaths. The mutex saved the GPU. What the mutex bought is a tool.