The previous post on this site (The GPU That Doesn’t Need CUDA) ended with a sentence that, in retrospect, made a promise. It said the Vulkan compute backend Spirit had grown was the foundation, and that everything else was mechanical assembly. The rest of the iteration would be the Rustler NIF wrapper, the Nx.Backend interface, the integration tests. Mechanical.
That post came out on a Wednesday. By Saturday the wrapper was real, ran on real hardware, passed real tests, and had — in the course of having those things — surfaced two production bugs that had nothing to do with assembly and everything to do with what happens when a vendor specification quietly assumes you know about something the documentation file you actually read forgot to mention. This is the story of those bugs, and of the line in Rust that closed the second one.
The bootstrap was, in fact, mechanical
The repo is called nx_vulkan. The structure is the
predictable shape: an Elixir lib/ with a top-level
module and an Nx.Backend implementation; a Rust
native/ crate that uses Rustler; a c_src/
directory with a flat C ABI shim that translates between Rust’s
extern "C" and Spirit’s C++. The decision to wrap Spirit’s
backend through a C shim rather than try to bind C++ from Rust
directly came in about ninety seconds — bindgen does not handle
C++ namespaces and STL types cleanly; a thirty-line C wrapper does.
The first commit on the new repo was the plan. The second was the project skeleton. The third was the moment the Elixir test runner called into Rust which called into C++ which called into Vulkan which actually returned the device name of the RTX 3060 Ti. By the fifth commit, you could write
iex> t = Nx.tensor([1.0, 2.0, 3.0, 4.0], backend: Nx.Vulkan.Backend)
#Nx.Tensor<f32[4]
Nx.Vulkan.Backend
[1.0, 2.0, 3.0, 4.0]
>
iex> Nx.add(t, t) |> Nx.to_flat_list()
[2.0, 4.0, 6.0, 8.0]
iex> Nx.exp(t) |> Nx.to_flat_list()
[2.7182, 7.3890, 20.0855, 54.5981]
iex> Nx.sum(t) |> Nx.to_number()
10.0
Thirty-seven tests passed. The seven Vulkan compute shaders Spirit
had grown — elementwise binary, elementwise unary, reductions,
naive matmul, tiled matmul, broadcasting elementwise binary, and
Philox+Box-Muller random — were now reachable from Elixir as
boring named functions. Nx.Defn graphs that used only
those operators dispatched correctly. Anything else — slices,
sorts, conv, autograd — fell back to Nx.BinaryBackend
through Nx’s built-in compatibility shim.
This was, as promised, mechanical. So the next thing to do, on a Saturday afternoon when the build was green and the suite was green, was to break it.
The race
The breakage hunt was a single Elixir script. It did six things:
race a hundred concurrent processes through Nx.Vulkan.add/2;
allocate fifty thousand tiny tensors in a tight loop and watch what
happened to VRAM; scale up tensor size until the upload returned
an error or the box died; benchmark Vulkan against the
BinaryBackend at every meaningful tensor size; produce
a divide-by-zero and inspect what came out of the GPU; and run a
matmul ladder from 128² to 2048² to find where the GFLOPS curve
flattened. The script took about two minutes to write. The first
section took less than thirty seconds to fail.
spirit-vulkan: wait (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
spirit-vulkan: submit (VkResult=-4)
... (about 800 more lines of this)
VkResult=-4 is VK_ERROR_DEVICE_LOST. The
NVIDIA driver had killed the device. Everything submitted after
the first failure also failed, because the device was lost. Once
that happens, the only path back is process exit and a fresh
vk_init next run.
This is, as failures go, a clean one. The hardware did not catch fire. The kernel did not panic. The driver simply marked the context dead and refused further work. But the question of why the driver had marked the context dead — that took a few minutes of reading.
The Vulkan specification has a section on threading semantics that
nobody loves. It is the section that says, of every Vulkan handle,
whether multiple host threads can touch it at the same time. Most
handles are fine — descriptor pools, command pools, buffers, all
permit concurrent access. The exception is VkQueue.
The spec describes VkQueue as externally
synchronized, which is the standards-document way of saying
“if more than one thread calls vkQueueSubmit on
the same queue without your own lock around it, you are off the
map and we make no promises about what happens.”
What had happened, in practice, was that one hundred Elixir
processes had each run their own add through the
NIF; the BEAM scheduler had spread them across all eighty-eight
hardware threads on the host; eighty-eight threads had
simultaneously called vkQueueSubmit on the single
global queue Spirit’s compute backend uses; and the NVIDIA driver,
finding its internal queue state corrupted by the unsynchronized
access, had concluded — quite reasonably — that the safest
response was to mark the device dead and return
DEVICE_LOST until someone restarted.
The Vulkan spec was right. The driver was right. The error code was right. The thing that was wrong was a missing line of Rust:
static SUBMIT_LOCK: Mutex<()> = Mutex::new(());
And one matching let _g = SUBMIT_LOCK.lock()?; at the
top of every NIF that submits. Six functions, twelve added lines,
no other code change. The mutex serializes every host-side
vkQueueSubmit. Vulkan’s threading rule is satisfied.
The same hundred-process race finishes in the time the scheduler
takes to run them sequentially, with all hundred processes
returning :ok and every result correct.
This is not the kind of fix that loses real performance. Spirit’s
submit_and_wait is itself blocking — it submits the
command buffer and then waits on a fence before returning.
Concurrent submits to the same queue do not actually overlap on
the GPU; they are processed in submission order regardless of
whether the host serialized the calls. The mutex costs us only
the wall-clock contention on the host side, and the host side was
already going to wait. What the mutex saves is the device.
Which is the kind of thing you write when you discover you have been violating the specification of the operating system you are calling, and the operating system has a sense of humor.
The empty list
The next breakage took longer to find because it looked like
nothing at all. The script did
Nx.Vulkan.divide([1.0, 2.0, 3.0], [0.0, 0.0, 0.0])
and downloaded the result and printed it. The expected output was
[Inf, Inf, Inf]. The actual output was [].
The empty list. Not a NaN. Not an infinity. Not an error. The download returned a binary of the correct size; the next line of Elixir tried to decode that binary into a list of floats; and the list it produced was empty.
The shape of the bug was visible to anyone who has spent time parsing IEEE 754 in Erlang. The Elixir-side decoder for downloaded tensors was a straightforward binary comprehension:
for <<x::float-32-native <- bin>> do
x
end
This is the canonical Elixir way to peel f32 values out of a packed binary. It works for finite floats. It works, beautifully, for forty-seven different Erlang releases going back ten years. It does not, as the Erlang documentation phrases it in language designed to be exactly as helpful as possible, “match values that are not floating-point numbers.” What this means in practice is that the bit patterns IEEE 754 reserves for NaN, +Inf, and -Inf — exponent field all-ones, with mantissa zero or non-zero respectively — fail to bind in the comprehension. The matcher silently drops them. A binary of three NaNs becomes a list of zero floats.
The GPU had done its job. The C++ shim had downloaded the right bytes. The wire was clean. The decoder, written in the most idiomatic Elixir possible, ate non-finite results without comment.
The fix is a function that decodes the raw 32-bit pattern, checks
the IEEE 754 exponent field for the all-ones case, and returns
:nan, :infinity, or
:neg_infinity as atoms. (This matches the convention
the rest of the BEAM ecosystem uses for non-finite floats; Nx
itself returns these atoms from to_number/1 on
infinite tensors.) Twenty lines of code; one passing test; the
divide-by-zero now reads
[:infinity, :neg_infinity, :nan] and the breakage hunt
moves on.
The number
With the mutex in and the decoder fixed, the speed section of the
breakage hunt finally produced its number. The number is what
Nx.Vulkan.Backend buys you over
Nx.BinaryBackend on a single Nx.add:
N=1024 CPU= 1.5 ms GPU= 0.3 ms 4.58×
N=16K CPU= 34.6 ms GPU= 0.5 ms 72.93×
N=256K CPU=569.5 ms GPU= 2.4 ms 235.94×
N=1M CPU= 2.84 s GPU= 1.2 ms 2310.88×
N=4M CPU=11.24 s GPU= 3.8 ms 2994.00×
Three thousand times. At four million elements, the BEAM-pure Elixir loop takes eleven seconds. The same operation on the GPU takes three milliseconds. The number is so absurd it deserves careful framing.
The CPU baseline is Nx.BinaryBackend, which is a
single-threaded list iteration through the BEAM. It is not SIMD.
It is not OpenMP. It is not the optimised C compute kernel that
EXLA’s host backend would use. It is the backend Nx ships by
default because it is the only one that runs on every platform
without a compiler. On any reasonable benchmark, it is the slowest
of the available options.
The fair comparison would be against EXLA on its CPU backend,
which would narrow the gap to something like ten or twenty times.
The fair comparison cannot be performed because EXLA does not
build on FreeBSD, and the entire point of Nx.Vulkan
is that EXLA does not build on FreeBSD. So the comparison
Nx.Vulkan-versus-BinaryBackend is the
honest comparison: what an Elixir program on the platform we
care about actually gets when it switches its default
backend. The answer is, depending on tensor size, between five
and three thousand times faster.
The three-thousand-times datapoint is also where the
Nx.Defn story lights up. A defn function
that does even a few hundred thousand-element ops will move from
several seconds to a few milliseconds purely by adding
Nx.Defn.default_options(default_backend: Nx.Vulkan.Backend)
to its setup. No code change in the function. No change to the graph the compiler emits. The same backend the rest of the BEAM ecosystem expects to plug into Nx, plugged in.
The adversarial round
The mutex closed the obvious bug. The decoder closed the obviously-empty bug. The breakage hunt’s second pass — done once both fixes had landed — went after the bugs that were not obvious. It found none.
The pass that mattered was the one that monitored
nvidia-smi across thirty seconds of one-megabyte
allocate-and-free cycles. Six thousand and ninety-nine cycles
later, VRAM utilisation was unchanged from baseline. Not within
some tolerance — exactly the same number of megabytes used. The
ResourceArc<VulkanTensor> in Rust, with its
Drop impl that calls into the C shim’s
nxv_buf_free which calls into Spirit’s
buf_free which calls into the Vulkan loader’s
vkFreeMemory, was tight enough that six thousand
allocations did not leak so much as a megabyte.
The pass that mattered next was the one that spawned two hundred
BEAM processes, each holding a ten-megabyte GPU tensor, and then
killed each process mid-tensor with
Process.exit(pid, :kill). This is the kind of
operation that, in a system written in C, would leak two gigabytes
of GPU memory in fifteen seconds. In nx_vulkan, two
gigabytes of allocations across two hundred dying processes
produced a VRAM delta of exactly zero megabytes. Erlang’s
process-exit GC found the resource references; Rust’s
Drop impl ran; the GPU buffer was released; the
device kept count.
This is what BEAM-side GPU bindings should look like, and it is
not what they typically do. The combination of Rustler’s
ResourceArc and Rust’s Drop trait
is not novel. What is novel — for those who have spent any time
in the OpenCL or CUDA-via-FFI literature — is how clean the
interaction is when both sides do their job. The Erlang VM
guarantees that resources outlive their references. The Rust type
system guarantees that destructors run when references go out of
scope. The combination guarantees that GPU buffers cannot leak
without simultaneously breaking either Erlang’s memory model
or Rust’s ownership rules. The combination, in practice,
does not break.
The other adversarial passes were anti-climactic. Pushing VRAM
toward the eight-gigabyte ceiling on the 3060 Ti returned
{:error, :alloc_failed} at three and a half gigabytes
of held buffers, with the GPU staying alive across the failure.
Off-by-one workgroup boundaries (N=255, N=256, N=257, N=511,
N=512, N=513, ...) all round-tripped correctly. Wrong-shape
matmul calls — claiming a 1000×1000 product from four-element
inputs — returned {:error, :size_mismatch} from the
Rust-side check before reaching the GPU. The empty tensor returned
a clean error. The ten-thousand-element matmul returned the right
answer.
Forty-two tests pass.
What this is
nx_vulkan at v0.0.7 is a working tensor backend for
Nx, on Vulkan, on FreeBSD, with f32 elementwise math and
transcendentals and reductions and naive-and-tiled matmul and
Philox-with-Box-Muller random. Its surface is small. The full
Nx.Backend behaviour has roughly seventy callbacks;
this implementation covers the twenty or so that map to Spirit’s
seven shaders. The remaining fifty fall back to
Nx.BinaryBackend automatically and produce a
deprecation warning at compile time, which is the right shape for
a partial backend.
What it is not is feature-complete. It does not do autograd. It does not do mixed precision. It does not do per-axis reductions, only all-axis. It does not do convolution. It does not do sorts. The roadmap to v0.1 closes some of these gaps; the roadmap to v1.0 closes the rest. None of them are interesting relative to the question this post is actually about.
The question this post is about is whether GPU compute on FreeBSD, via Vulkan, via Spirit, via Rustler, into Nx, ends up being a real tool or a demonstration. The bench numbers say something. The lifetime tests say more. A backend that does not leak under thirty seconds of churn, that does not crash when processes die mid-tensor, that does not destroy the device under contention — that is closer to a tool than a demonstration. Three thousand times faster than the reference baseline at the size where speedups matter.
Which means the next thing to write is not more shaders. The next
thing to write is the Hex package release notes, the documentation,
the migration guide for the trader who will switch
Application.put_env(:nx, :default_backend, ...) from
{Nx.BinaryBackend, []} to
{Nx.Vulkan.Backend, []} and want to know what changed.
The Spirit Vulkan substrate has a tensor language on top of it.
The tensor language has a wrapper. The wrapper survives adversarial
testing. The thing now is what the people who will use it can do
with it.
Coda
The first post in this series said that Vulkan was the door. The
second said the door was unlocked. This one says: we have walked
through the door, looked at the room, and the room is the size we
needed it to be. Three thousand times the speed at four million
elements. Zero leaked megabytes after thirty seconds of churn.
Zero leaked megabytes after two hundred process deaths. Five days
from mix new to a green Hex-shaped repository on a
NAS at 192.168.0.33.
The two bugs we found were both interesting. The race was interesting because the Vulkan specification was right and the operating-system rule was clear and our code violated it without any of us noticing for five days. The decoder was interesting because the most idiomatic Elixir construction in the language silently dropped the most important class of edge values.
What both bugs share is the shape of error CUDA programmers do
not have to think about. CUDA hides queue management; the
specification does not say externally synchronized
because CUDA does not expose a queue at all. CUDA tensors do not
need an Elixir-side decoder because their CPython bindings
returned numpy arrays decades ago. The mistakes that
nx_vulkan made are mistakes about Vulkan’s
contract and Erlang’s pattern matcher specifically — the
two things between us and the GPU. They were not mistakes about
GPU programming. They were mistakes about the seam.
The seam is also the part nobody else has built. EXLA does not
build on FreeBSD. EMLX does not build on anything except Apple
Silicon. There has been, until this week, no Nx tensor backend
on FreeBSD, on AMD GPUs, on Intel iGPUs, on every machine that
ships a Vulkan loader and a driver and a couple of GB of VRAM.
There is now. It is called nx_vulkan. It is at
github.com/borodark/nx_vulkan. It survives 1500 lines of test code
and a thirty-second VRAM-watch and two hundred BEAM process
deaths. The mutex saved the GPU. What the mutex bought is a tool.