On a batch-1 decode of a two-billion-parameter language model, an RTX 3060 turns in 3.67 joules per token. A six-year-old desktop CPU does 4.62 — almost as well. For a 400-dollar GPU against a commodity processor, that is a rout that isn’t: the GPU is barely winning, and it is actually slower (23.5 tokens/second versus the CPU’s 28.4).

The reason is the punchline of this whole post. The model’s weights are ternary — every number is $-1$, $0$, or $+1$ — and the GPU has no idea what to do with them. It dequantizes those 1.58-bit weights back into 16-bit floats, fills 4.87 GB of memory it didn’t need to, and runs its tensor cores at a rounding-error of their potential. The 1.58-bit representation, the whole point of the model, evaporates the moment it hits the hardware.

A 130-dollar FPGA does not throw that away. This post is the story of building one that doesn’t — ternfpga, a multiply-free ternary LLM-inference engine on a Xilinx Arty A7-35T, benchmarked head-to-head against that same RTX 3060 in the same machine. By the end, the FPGA system does the same work at an estimated 1.6 joules per token — roughly 2.3× less energy than the GPU — by refusing to do the one thing everyone assumes a neural network must do: multiply.

I am going to be exact about the word estimated. This is a hardware project, and hardware invites overclaiming the way optimization benchmarks do. So every number below carries a tag: silicon-measured (I read it off the board), derived (composed from silicon-measured primitives), or projected (a forward-looking estimate). The headline energy number is derived; three of its four ingredients are silicon-measured. I’ll show you which is which, and there’s a full ledger near the end.

What This Post Covers

This post assumes you know roughly what a neural network and a matrix multiply are. It does not assume you know what an FPGA, a DSP slice, or a roofline is — we’ll build those up. If you’re a hardware person, skim Part I.

A preview of the result, for the impatient:

Energy per token — BitNet-2B-4T, batch-1 decode FPGA system energy as transformer "glue" moves onto the fabric, one stage at a time. Lower is better. 0 1 2 3 4 5 joules / token 4.32 host-split (naive) 1.2× WORSE 1.99 + on-fabric attention 1.8× under 1.62 + on-fabric FFN glue 2.3× under 1.47 engine bound 2.5× (proj.) CPU 5950X — 4.62 RTX 3060 — 3.67 J/tok · the line to beat

The arc of the whole project in one figure. The naive design (left, red) loses to the GPU by 1.2×. Each subsequent bar moves a piece of non-ternary "glue" computation from the slow host CPU onto the FPGA fabric; the system drops below the GPU line and keeps falling toward the engine's own 1.47 J/token floor. The first three FPGA bars are derived from silicon-measured cycle counts; the rightmost is a projection.

The shape of that figure — a bad start, a decisive correction, and a steady climb toward a hard floor — is the project. But to see why any of the bars are where they are, we have to start with a fact about how language models actually run.


Part I — Why a 130-Dollar Board Can Win

Decode is a memory problem, not a math problem

When a language model generates text, it does so one token at a time, and each token requires a full pass over the model’s weights. In the batch-1 case — one user, one stream, the dominant case for local and edge inference — there is no batching to amortize anything. To produce a single token, the hardware reads every weight in the model from memory exactly once and does a couple of arithmetic operations with it.

That ratio — arithmetic operations per byte read from memory — is called arithmetic intensity, and for batch-1 decode it is brutally low. A weight is fetched, multiplied, accumulated, and discarded. Two floating-point operations for every two bytes of FP16 weight: an intensity around 1 FLOP/byte. Modern accelerators are built for the opposite regime — training and large-batch serving, where each weight gets reused across hundreds of examples and intensity is in the hundreds.

This is what a roofline model makes visible:

arithmetic intensity (operations per byte) → attainable throughput → memory-bandwidth roof compute roof (tensor cores) memory-bound compute-bound batch-1 decode one token = stream every weight once ≈ 1 FLOP / byte training / large batch reuse weights → compute-bound all this compute sits idle ↓

A roofline diagram. Performance is capped either by memory bandwidth (the rising diagonal) or by raw compute (the flat ceiling), whichever is lower at a given arithmetic intensity. Batch-1 decode lives far to the left, pinned to the bandwidth roof — the GPU's enormous compute ceiling is irrelevant, because the work is gated entirely by how fast weights arrive from memory.

This is the crack in the GPU’s armor. When you are decode-bound, the expensive part of the GPU — the tensor cores — is idle, waiting on memory. You are paying for, and powering, silicon you cannot use. What matters is bytes-per-second of weight traffic and the energy spent moving them. And that is a contest a small, low-power device can enter — provided it can cut the bytes.

There are two ways to cut the bytes that a GPU structurally cannot follow. The whole project is built on them.

Escape 1: stop multiplying

The first escape is the model itself. BitNet b1.58 is a family of large language models whose weights are constrained, during training, to just three values: $-1$, $0$, and $+1$. “b1.58” is the information content of one such weight: $\log_2 3 \approx 1.58$ bits, against the 16 bits of an FP16 weight or the 8 of INT8. The activations stay in normal integer precision; only the weights are ternary. Remarkably, this barely dents quality — the 2-billion-parameter BitNet-2B-4T is competitive with same-size FP16 models — because the network is trained in this regime rather than quantized after the fact.

Here is the part that matters for hardware. A matrix-multiply is a sea of multiply-accumulates, $\text{acc} \mathrel{+}= w \cdot a$. When $w \in \lbrace -1, 0, +1 \rbrace$, that multiply is not a multiply at all:

\[w \cdot a \;=\; \begin{cases} +a & w = +1 \\ \phantom{+}0 & w = 0 \\ -a & w = -1 \end{cases}\]

There is nothing to multiply. You either pass the activation through, zero it, or negate it. In digital logic that is a three-way select — a few gates, a single small lookup table — and it is the entire arithmetic core of the engine:

activation a int8 sign / zero select w = +1 → +a w = 0 → 0 w = −1 → −a ternary weight w (2 bits) Σ accumulate int32 one 6-LUT · zero DSP no hardware multiplier touched

The entire "multiply" in a ternary network: a sign-and-zero select, implementable as a single six-input lookup table. The chip's dedicated hardware multipliers are never touched. A GPU, by contrast, has no ternary datapath — it must expand each weight back into an INT8 or FP16 value and run a real multiply through its arithmetic units.

The bytes-saved story compounds with a packing trick. Three ternary values have $3^3 = 27$ combinations; five have $3^5 = 243$, which still fits in a single byte ($243 < 256$). So you can store five ternary weights per byte — 1.6 bits each, within a whisker of the 1.585-bit theoretical optimum, and 5× denser than INT8. Since decode is bandwidth-bound, that packing is a direct, multiplicative cut to the traffic that bottlenecks everything. A small combinational decoder turns one byte back into five weight codes on the fly, feeding the sign-select lanes straight from a memory burst.

The GPU cannot follow here. Its tensor cores have no $\lbrace -1,0,+1 \rbrace$ mode; the most efficient thing it can do is dequantize the ternary weights up to INT8 or FP16 and run ordinary multiplies. It pays the full memory traffic of the wide format and the full energy of real multipliers — for weights that carry 1.58 bits of information. That is exactly why, in the benchmark that opens this post, the 3060 barely beats a CPU: it is doing the expensive version of an inexpensive problem.

Escape 2: stop fetching zeros

The second escape is activation sparsity. BitNet’s feed-forward blocks use a squared-ReLU nonlinearity, which forces a large fraction of intermediate activations to exactly zero on every token. When an activation is zero, the entire column of weights it would have multiplied never needs to be fetched — those bytes contribute nothing to the result.

I measured this on BitNet-2B-4T directly, hooking all thirty down_proj layers over diverse text: 59.8% of activations are zero per token (ranging 42–79% by depth). That is below the 85–95% that relu-fied models like ProSparse reach — I’ll return to that honesty point later — but it is real, and it is a ~2.5× reduction in down_proj weight traffic that comes for free with the model.

activation vector h ~60% exactly zero (per token) 3 0 0 −2 0 5 0 0 gather nonzeros only compacted 3 −2 5 fetch only these columns weight columns in DRAM solid = fetched · faint = skipped entirely

Activation-sparse gather. The zeros in the activation vector mean the corresponding weight columns are never read from memory. At the measured 60% sparsity, down_proj fetches ~44% of the dense bytes — bit-exact, because skipping a column that gets multiplied by zero changes nothing.

Once again, the GPU cannot follow. Its hardware accelerates only 2:4 structured sparsity — a rigid pattern where exactly two of every four weights are zero, fixed at model-compression time. What BitNet produces is unstructured and per-token: a different ~60% of activations are zero on every single token, in no fixed pattern. A GPU faced with this either ignores it (and fetches everything) or pays so much overhead chasing irregular indices that it loses. An FPGA, whose datapath you design yourself, can build a gather that issues a memory read only for the nonzero columns — and skips the rest at full speed.

The FPGA’s unfair advantage: zero DSPs, sub-watt

So far this is an argument about bytes. The reason it converts into an energy win is the FPGA itself.

A field-programmable gate array is a sheet of reconfigurable logic: hundreds of thousands of lookup tables (LUTs) and flip-flops you wire into whatever digital circuit you want, plus a few hundred dedicated hardware multipliers called DSP slices. On the Arty A7-35T — the cheapest useful Artix-7 board, around 130 dollars — there are about 20,800 LUTs, 41,600 flip-flops, 90 DSP slices, and 50 small blocks of on-chip memory (BRAM).

A conventional matrix engine spends those 90 DSP slices doing multiplies, and 90 multiplies is not very many — it is the heart of why a small FPGA loses on raw throughput. But a ternary engine does not multiply. The sign-select core synthesizes entirely into LUTs and the chip’s carry chains; I confirmed in synthesis that it uses zero DSP slices all the way up to a 2048-wide datapath. All 90 multipliers sit unused. The engine is not competing for the scarce resource; it sidesteps it.

And it does so at a power level a GPU cannot approach. The whole system-on-chip — the ternary engine, a RISC-V CPU, and the DDR3 memory controller — draws an estimated 0.489 W. The ternary engine alone is about 0.06 W. The RTX 3060, doing the same decode, measured 86.4 W. That is a ~175× power gap at the system level, and it is the entire denominator of “energy per token.” Even running far slower, a device that sips power while cutting memory traffic can spend less energy getting to the same token.

Let me be honest about the other side of the ledger, because it matters: the FPGA loses on raw throughput, by design and by a lot. With ~280× less memory bandwidth than the 3060, it will generate tokens more slowly. This project never claims a “40× faster” headline — that would be dishonest. The claim is narrower and, I think, more interesting: on energy per token, batch-1 latency, and a capability the GPU lacks (native ternary, per-token unstructured sparsity), a 130-dollar board can beat a 400-dollar GPU. Those are the axes that matter at the edge, where there is no datacenter to batch your requests and the power budget is a wall, not a line item.

The honest tradeoff — energy vs throughput batch-1 decode, BitNet-2B-4T. The FPGA wins energy; it concedes throughput, by design. 0 1 2 3 4 5 energy (J / token) ← lower is better 0.3 1 3 10 30 throughput (tokens / second) — log scale → widen K → ~3 tok/s, same energy FPGA (Arty A7) 1.62 J/tok · derived · sub-watt RTX 3060 3.67 J/tok · 86 W · measured CPU 5950X 4.62 J/tok · 121 W CPU and GPU cluster top-right: fast, but power-hungry. The FPGA trades that for the bottom-left.

The whole thesis on two axes. The CPU and GPU sit top-right — high throughput, high energy. The FPGA sits low (less than half the GPU's joules per token) and far left (slow). Crucially, energy per token is roughly invariant to the engine width, so widening the datapath slides the FPGA point rightward — more tokens per second at the same energy — rather than down. Throughput is a dial you can turn; the energy floor is the structural win. CPU/GPU points measured; the FPGA point derived from silicon-measured primitives.

That is the case on paper. The rest of this post is what happened when I tried to build it.


Part II — Building It, One Honest Measurement at a Time

I built ternfpga as a strict test-driven loop, because hardware bugs are expensive and silicon bugs are very expensive. Every module got a NumPy “golden” reference and a cocotb testbench that checked the RTL bit-exact against it before anything was synthesized, let alone flashed. The development cycle ran across two machines: I authored on a laptop, an rsync pushed the tree to a Linux box with the FPGA physically attached, and that box ran Verilator + cocotb for simulation, Vivado for synthesis, and openFPGALoader to flash the board. Results — including the board’s own UART output — streamed back. And I kept an append-only build log, dated, including the dead-ends. Most of the good anecdotes below are lifted straight from it.

What follows is nine phases. They are not all glamorous. One of them is the FPGA losing.

Phase 0 — the multiply-free core, from a unit test to silicon

The first thing to exist was the NumPy golden for a ternary dot product, then its cocotb test, then the SystemVerilog. The RTL is the sign-select from Part I, eight lanes wide, summed in an adder tree. 6 directed edge cases plus 2,000 randomized dot products, bit-exact, zero mismatches — before a single gate was synthesized.

Then synthesis, on the real part, to check the central claim:

module LUTs FFs DSP48 Fmax (synth est.)
ternary_dot 233 (1.1%) 0 0 combinational
ternary_gemv 384 (1.9%) 582 0 ~104 MHz
ternary_gemv_sparse 521 (2.5%) 664 0 ~116 MHz

Zero DSP slices on every module. Vivado confirmed the ternary “multiply” is pure lookup-table sign-select plus carry-chain adders — all 90 hardware multipliers free. A later three-stage pipeline lifted the dot from ~104 MHz to ~280 MHz while shrinking to 149 LUTs (registers break the long adder tree into cheaper pieces), still 0 DSP. The full memory-to-compute datapath looks like this:

activation x in on-chip BRAM DDR3 weights, base-3 5 trits / byte unpack 1 byte → 5 weight codes sign-select dot K lanes + adder tree 0 DSP accumulate over NT tiles int32 y = W·x BRAM 1.00 cycle / tile — measured on silicon 8 ternary MACs/cycle = 800 M MAC/s at 100 MHz, sustained, bit-exact

The engine datapath. Weights stream from DRAM in the dense base-3 format, a combinational decoder expands each byte into five weight codes, the sign-select lanes consume them against the BRAM-resident activation, and partial sums accumulate across tiles into the output. Nowhere in this path is there a hardware multiplier.

The milestone of Phase 0 was getting that datapath onto the physical board. I wrote a tiny top-level design — the ternary dot of a running counter against a fixed weight vector, streamed out over the USB serial port — synthesized it to a bitstream, and flashed the Arty. The board was supposed to print y = 2·counter on every line. It printed y = 0.

The counter was incrementing correctly, so the serial path was fine; the bug was in the logic. I wrote a full integration simulation that reproduced the y = 0, then probed the internals: the dot’s inputs were arriving correctly, but its output was zero. The root cause was embarrassing and instructive. My “weight constant” 0xAA55 packed to four $+1$s and four $-1$s — which sum to zero, not the $+2$ I intended. The correct constant was 0xA955. Three separate unit tests had passed; all three used the real test vectors and never exercised that specific hard-coded demo constant. The integration test earned its keep. One character fixed, rebuilt, reflashed:

16/16 UART lines: y == 2·counter

The multiply-free engine, computing correctly in fabric: 105 LUTs, 0 DSP, 100 MHz met, ~63 mW on-chip by Vivado’s power estimate. That 63 mW — more than a thousandfold below the GPU’s draw — is the number the whole energy argument rests on.

The baseline triad — and a very awkward number for the GPU

You cannot claim an energy win without measuring what you’re beating. So I stood up two baselines on the same machine that hosts the FPGA. The CPU baseline used bitnet.cpp, the official ternary inference runtime, with energy read from the processor’s own RAPL counters. The GPU baseline took a maintenance window — I did a live driver swap from nouveau to NVIDIA without a reboot (a second display GPU kept the console alive, so the only network path to the box never dropped), then measured decode throughput and nvidia-smi power.

platform path tok/s power J / token
CPU 5950X native ternary (i2_s) 28.4 ~121 W 4.62
RTX 3060 bf16 (dequantized) 23.5 86.4 W 3.67
FPGA Arty ternary, 0 DSP (building) ~0.06–0.5 W (the rest of this post)

There is the awkward number. The RTX 3060, faced with BitNet, has no ternary datapath, so it dequantizes the weights to bf16 — inflating a model that should occupy a few hundred megabytes into 4.87 GB — and runs ordinary tensor-core multiplies. The result: 3.67 J/token, barely better than the CPU’s native-ternary 4.62, and actually slower. A $400 GPU, extracting almost no value from the 1.58-bit weights it was handed. That gap is the entire opportunity, quantified. The FPGA’s job is to not throw the ternary structure away.

Phase 1 — DDR3 and a RISC-V CPU, on the board

To run anything model-sized, the engine needs to stream weights from the board’s DRAM, and it needs a host to sequence the layers. I brought up a LiteX system-on-chip on the Arty: a VexRiscv RISC-V CPU, the LiteDRAM controller driving the board’s 256 MB DDR3, and the ternary engine wired in as a memory-mapped peripheral. The hardest part of any FPGA project like this is DRAM calibration, and it came up green — Memtest OK, read leveling calibrated. Then the firmware drove the engine and checked it:

=== ternfpga on-board streaming GEMV (K=8, M=16) ===
GEMV_ONBOARD_PASS  (16 rows bit-exact vs golden)

The full chain — CPU writes the activation, streams packed weight bytes, the engine unpacks and does the multiply-free dot, the CPU reads the result — bit-exact, on silicon, in a real SoC. The two scariest integration risks (DRAM calibration and the CPU↔engine interface) were now retired.

Phase 2 — the pivot, where honesty changed the plan

This is the phase where the project nearly got more ambitious and instead got more honest. Before committing many sessions to “scale the engine to a full model,” I ran a structured literature review. It returned a verdict that was partly humbling and entirely useful:

So I re-scoped, explicitly: from “a full model on the board” down to one real-width transformer block, streamed from DRAM, with the non-ternary glue (normalization, attention softmax, the LM head) running on the host CPU, and the headline being energy per token versus the RTX 3060 on identical numerics. Nobody had built an LLM datapath on an Artix-7-class board; that white space was the point.

Two de-risking measurements followed, both before writing the block. The first was a place-and-route fit sweep — how wide can the datapath get on a 35T before it stops fitting?

datapath width LUT % LUT FF % FF DSP
32 565 2.7% 865 2.1% 0
1024 10,234 49% 24,675 59% 0
2048 11,013 53% 32,961 79% 0

Zero DSP holds all the way to width 2048 — the multiply-free property, proven at real scale. But notice the flip-flops: 79% at width 2048. The wall isn’t compute; it’s keeping operands in registers. The lesson, which shaped the entire microarchitecture, was: the scalable engine must be BRAM-centric — operands live in block RAM and stream through sequentially — not a giant flat array of registers. (Moving operands from flip-flops to BRAM later collapsed the flip-flop usage ~90× and met timing where the register-resident version had failed by 5.9 ns. I’ll spare you the third recurrence of the underlying bug until it bites again below.)

The second measurement was sparsity. Direction “skip the zeros” needs zeros to skip, and I could find no published figure for BitNet b1.58’s FFN sparsity — so I measured it: 59.8% of activations zero per token, averaged over diverse text across all thirty layers. That is real and GPU-unmatchable, but it is below the 85–95% that relu-fied models reach. I corrected the project’s own README to match the measurement rather than the hope. Honest beats optimistic; this theme recurs.

Phase 2 also turned up a small piece of mathematics I find genuinely lovely. The feed-forward block computes, per channel, $\text{relu}(\text{gate})^2 \cdot \text{up}$, normalizes it, and feeds the result — requantized to int8 — into the final projection. That requantization looks like it needs floating-point: dequantize the integer matmul outputs by their per-token scales, apply the RMSNorm divide, then re-quantize. But the int8 value that actually reaches the next matmul is

\[h_{q,i} \;=\; \text{round}\!\left(\frac{127 \, N_i}{\max_j \lvert N_j \rvert}\right), \qquad N_i = \text{relu}(g_i)^2 \cdot u_i \cdot w_i\]

where $g_i$ and $u_i$ are the integer gate and up outputs and $w_i$ is a fixed-point norm weight. The requantization is a ratio — $N_i$ over the maximum $\lvert N \rvert$ — and every per-token dequant scale and the entire RMSNorm normalizer are common positive factors that appear in both the numerator and the denominator. They cancel, exactly. The “hard” floating-point glue between the matmuls is, on-chip, pure integer arithmetic plus a single reciprocal. I verified this against the validated reference: with floating-point norm weights it is a 100.00% exact match; with 16-bit fixed-point weights, 99.99% (off by at most 1). That identity is what later makes an on-fabric glue unit clean instead of nightmarish.

Phase 3 — the projection, and two risks named out loud

With the block datapath validated against PyTorch (the full decoder layer reproduced the real model at cosine similarity 1.000000), I could compose a full-model energy estimate from silicon-measured primitives — the engine’s 1.00 cycle/tile, the measured DRAM bandwidth, the measured power. It came out to roughly 1.47 J/token of engine compute, about 2.5× under the GPU. But a projection is only as honest as the risks it confronts, and the review had named two.

Risk 1: maybe single-channel DDR3 is too slow to matter. I measured the actual read roofline with a hardware DMA engine and a cycle counter: 1,423 MB/s sustained, 89% of the memory port’s theoretical peak. That caps a 0.7B model at about 8 tokens/second — slow, as promised, but the energy floor is ~60 mJ/token, still tens of times under the GPU. The engine itself only demands 200 MB/s, so it is compute-bound, not bandwidth-starved; the path to using the full channel is a wider datapath, not faster memory. Risk 1 survives.

Risk 2: maybe the sparsity is fake — if those 60% zeros fell in a fixed or regular pattern, a GPU’s structured-sparsity hardware could capture them and the differentiator would evaporate. So I measured the structure: 93.9% of channels are data-dependent (only ~4% are statically zero), the active set changes so much token-to-token that two tokens share less than half their nonzeros, and a static structured mask captures only 69% of the zeros. The sparsity is genuinely unstructured — exactly the kind a GPU cannot exploit and an FPGA gather can. Risk 2 holds.

Phase 4 — the fully-measured verdict: the FPGA loses

Here is the phase I most want to keep in the post, because it is the one that almost every writeup would quietly delete.

Up to now, the engine was silicon-measured but the system energy was a projection. To make it real, I rewrote all the transformer “glue” — the normalization, the rotary position embedding, the attention scores and softmax — as pure integer code (lookup tables for the transcendentals, the cancellation identity for the rest) and measured it running on the board:

norm = 0.54M   rope = 0.08M   attention = 16.2M   ffn-glue = 2.58M
GLUE_INT_PER_LAYER = 19.42 M cycles

And then the arithmetic stopped being kind. The engine is 8.68M cycles per layer; the glue is 19.42M — more than twice the engine. A full token is $30 \times 28.1\text{M} + \text{LM head} = 884\text{M}$ cycles, which at 100 MHz and 0.489 W is 4.32 J/token. The engine alone would be 2.5× under the GPU — but the naive host-split system was 1.2× worse than the GPU it was supposed to beat.

That is a real result, measured on real silicon, and it said the design was wrong. The diagnosis was unambiguous: 83% of the glue was attention — the scores, the softmax, the weighted sum over the value cache — running on a cacheless soft RISC-V core, bottlenecked on DRAM latency. The CPU was a terrible place to do attention.

(A debugging confession from this phase, because it’s the kind of thing that eats a day: an earlier version of the glue firmware halted after printing about six characters over serial. I blamed the soft-float library, the timer, the heavy math. It was none of those. LiteX’s serial output is interrupt-driven, and I had omitted the one line that enables interrupts. The transmit buffer filled and stalled. The right move would have been to check the I/O path before the soft-float rabbit hole. Lesson re-learned: suspect the boring thing first.)

The losing number was the most valuable measurement in the project. It converted an architectural opinion — “attention should be on the fabric” — into a quantified necessity, and it set the agenda for everything that followed. The next three phases are the climb back up the first figure in this post.

Here is the layer the climb produces — every operation tinted by where it ends up running. The terracotta and green are silicon; only the thin grey blocks stay on the soft CPU:

ternary engine · 0 DSP on-fabric unit host CPU (VexRiscv) hidden state hidden state out ATTENTION FFN RMSNorm — input Q · K · V projection RoPE (rotary position) scores · softmax · a·V attn sub-norm O projection + RMSNorm — post-attention gate · up projection squared-ReLU · glue · int8 requant down projection + ×3 7 ternary GEMVs per layer, all 0 DSP

One BitNet decoder layer, with each operation colored by where it runs in the final design. The seven matrix multiplies (Q/K/V/O and gate/up/down) are ternary GEMVs on the 0-DSP engine; attention and the FFN glue become dedicated on-fabric units over Phases 5–8; only the RMSNorms and RoPE — the thin grey blocks, ~0.6M cycles a layer — stay on the host CPU. The arrows down the left spine are the residual stream; the ⊕ are the residual adds. This is the picture the next four phases assemble.

Phases 5 & 6 — attention onto the fabric, and the verdict flips

If the CPU is a bad place to do attention, the fix is to build attention in hardware. The attention_unit keeps the key and value cache in on-chip BRAM and, for one query, computes the scores (an int16 multiply-accumulate against each cached key), turns them into a softmax, and produces the weighted sum over the values. The softmax is the interesting part, because a naive softmax wants floating-point exponentials and a division. I avoided both:

Bit-exact against its integer oracle, the unit runs at about one multiply-accumulate per cycle, which makes attention roughly 98× faster than the host version. (Two bugs en route, both worth the warning: the exp table’s first entry was 32768, one past the signed-16-bit maximum, and silently wrapped negative — clipped to 32767. And the key/value memories, when their read port lived in a block with an asynchronous reset, synthesized as a vast array of flip-flops with an address decoder instead of as BRAM, blowing LUT usage to 90%. Moving the read into a clock-only block with an explicit ram_style="block" hint fixed it. Hold that thought — it happens again.)

What does that do to the first figure in this post? Replace the 16.2M-cycle host attention with the on-fabric ~0.33M, and the glue per layer collapses from 19.4M to 3.5M cycles. The layer drops from 28.1M to 12.2M, the token from 884M to ~407M, and the energy from 4.32 J to 1.99 J/token. The system goes from 1.2× worse than the GPU to roughly 1.8× better. The engine is now 71% of the layer’s work — the 0-DSP ternary advantage finally shows up at the system level, not just the kernel level. That is the single most important transition in the project, and it is the second bar in the opening figure.

Phase 6 put it on silicon. Wrapped as a peripheral, built, flashed, and run on the physical board:

ATTN_ONBOARD_PASS  (128 num + sum_e bit-exact)
MEASURED attention cycles/query = 16456  (T=64, D=128)

Bit-exact on real hardware, at one MAC per cycle confirmed — a ~49× collapse versus host attention at this cache depth. Attention now had the full PyTorch → simulation → silicon chain, so the flipped verdict rested on a measured term, not a synthesized estimate. (Timing honesty, since this matters: the static analyzer reported the design missing 100 MHz by 1.27 ns at the worst-case corner. It ran bit-exact at 100 MHz anyway, because the worst-case model is pessimistic and the room is air-conditioned — but the honest fix, a pipeline register on the critical path, is noted as owed. I am not going to pretend a negative slack number is a positive one.)

Phases 7 & 8 — the last big glue term, and the bug that came back a third time

With attention handled, the largest remaining host term was the FFN inter-projection glue — the $\text{relu}(\text{gate})^2 \cdot \text{up} \cdot w$ and int8 requantization from Phase 2, 2.58M cycles a layer on the host. The ffn_glue_unit does it on-fabric in two passes: compute every $N_i$ and track the running maximum $\lvert N \rvert$, then requantize. The requantization needs a divide by $\max \lvert N \rvert$ per channel, which I did not want to instantiate 6,912 times. The trick: compute one reciprocal per call — $\text{recip} = (127 \ll R) / \max \lvert N \rvert$ with $R$ chosen so the reciprocal lands in a fixed 32-bit window — using a single restoring divider, then each channel’s requant is a multiply and a shift, $h_{q,i} = (\lvert N_i \rvert \cdot \text{recip} \gg R) \cdot \text{sign}$. One divide, reused across all channels.

Bit-exact against the oracle, ~165× faster than the host glue. And then synthesis came back at 115% of the LUTs and 134% of the flip-flops — it didn’t fit at all. The cause was the bug from Phase 2 and Phase 5, for the third time: the output memory hqmem had its write in one always-block and its read in another, so the tool couldn’t infer a block RAM and instead built a 6,912-deep array of flip-flops plus an address decoder. Putting the write and read in the same clock-only block turned it into a proper simple-dual-port BRAM, and usage dropped to 7% LUT / 1% FF / 40% BRAM / 19 DSP. Three times this exact mistake cost me a synthesis run; it is now the first thing I check when utilization looks insane. (The lesson, stated generally: a memory whose read and write live in different procedural blocks will not infer as BRAM. Tattoo it somewhere.)

The unit’s single-cycle compute path was far too long for 100 MHz — a negative slack of 42.7 ns. Pipelining it was mechanical once I committed to one operation per stage: eight stages, with a valid bit and the channel index and operands all travelling together, closed it in three principled steps (−42.7 → −8.9 once pipelined → −3.2 once I registered the requant scale so a priority encoder left the per-cycle path → −1.9 ns once I split a 122-bit add from a barrel shift). Same cycle count, same bit-exact result. Then, on silicon:

FFNGLUE_ONBOARD_PASS  (6912 h_q + max|N| bit-exact)
MEASURED ffn-glue cycles/layer = 13974  (F=6912)  →  184× vs host

Bit-exact on the board at the real FFN width, a 184× collapse. (The measured 13,974 is more trustworthy than the simulation’s earlier 15.6K extrapolation, which had double-counted a per-pass drain — measuring the real thing beat extrapolating from a small one.) The system energy fell to 1.62 J/token, ~2.3× under the GPU — the third bar in the opening figure. With both big terms on the fabric, the engine is now 90% of the layer, and the largest thing left is the pair of RMSNorm operations at 0.54M cycles — the gap between the 1.62 bar and the 1.47 engine floor.

The bookkeeping milestone of Phase 8: three of the system’s four cycle terms — the engine, attention, and the FFN glue — are now silicon-measured. Only the RMSNorm and the fully-integrated loop remain projections.

Here is the whole arc in cycles rather than joules — the mechanism underneath the energy ladder from the top of the post:

Where the cycles go — one BitNet-2B layer Engine work is fixed; moving glue onto the fabric collapses the dominant term. The engine's share rises 31% → 71% → 90%. 0 10M 20M 30M cycles / layer 28.1M → 4.32 J attention 31% 12.2M → 1.99 J 71% 9.65M → 1.62 J 90% host-split + on-fabric attention + on-fabric FFN-glue ternary engine (fixed, 8.68M) attention glue FFN glue norms + RoPE

The same three states as the headline figure, counted in cycles per layer instead of joules. The terracotta base — the ternary engine — never changes. What collapses is the glue on top: the red attention block (16.2M cycles, host-bound) all but vanishes when it moves to the fabric, then the FFN glue follows. By the third bar the engine is 90% of the layer, and only the thin grey sliver of norms and RoPE is left on the host.

Phase 9 — two accelerators on one board, cooperating

Every silicon result so far measured one unit in isolation. The capstone was to put two of them in the same chip and make them hand data to each other: the ternary engine computes the gate and up projections, its integer outputs feed the FFN-glue unit, and the requantized result comes back — the actual data flow of an FFN, end to end, on real hardware.

engine gate/up GEMV: ok (0 row mismatches)
COMBINED_ONBOARD_PASS  (engine -> ffn_glue, 32 h_q bit-exact, ffn-glue 214 cyc)

Bit-exact, end to end, on the board — the first multi-accelerator computation in the project. It also pinned the honest frontier, which is the most important thing this phase has to say:

On-chip memory (BRAM) is the wall Arty A7-35T has 50 block-RAM tiles. A pair of accelerators fits; all three don't. 0 10 20 30 40 50 60 BRAM tiles used SoC + engine ~27 FFN-glue ~18 45 ✓ engine + FFN-glue built, bit-exact on silicon SoC + engine FFN-glue attention ~18 63 ✗ all three over budget — needs tiling / bigger board 50-tile budget

The frontier, measured. The ternary engine and the full-width FFN-glue unit, plus the supporting CPU/DRAM system, fit a 35T at 45 of its 50 block-RAM tiles — built and verified. Adding the attention unit's ~18 tiles would need 63, over the budget. The full three-accelerator decode loop wants either FFN tiling or a board with more on-chip memory (a 250-dollar A7-100T, or a Zynq KV260).

The pair fits at 45 of 50 BRAM tiles; adding attention’s ~18 would need 63. So the fully integrated three-accelerator decode loop does not fit a single 35T — it needs either FFN tiling (narrowing the glue/attention to share memory) or a board with more on-chip RAM. I want to be unambiguous about this, because it is the one place the “single 130-dollar board” framing has an asterisk: each accelerator is silicon-proven, a pair co-resides and cooperates on silicon, but the whole loop in one bitstream is the step that needs a bigger board or a tiling pass. The energy argument is built from cycle counts and holds regardless of which board the loop ultimately runs on; what Phase 9 proves is that the accelerators share a die and hand off data for real.


The Evidence Ledger

Hardware projects, like benchmark papers, are easy to oversell. So here is every load-bearing claim in this post, sorted by how I actually know it. Silicon-measured means I read it off the physical board. Derived means I composed it from silicon-measured primitives. Projected means it’s a forward estimate I have not yet built.

Claim Value Evidence
The ternary “multiply” uses no hardware multipliers 0 DSP up to datapath width 2048 silicon-measured (synthesis + on-board)
Engine throughput 1.00 cycle/tile, bit-exact (800 M MAC/s) silicon-measured
DDR3 read bandwidth (the decode bottleneck) 1,423 MB/s (89% of port peak) silicon-measured
Host-CPU transformer glue 19.42 M cycles/layer silicon-measured
On-fabric attention 16,456 cycles/query, bit-exact silicon-measured
On-fabric FFN glue 13,974 cycles/layer, bit-exact (184×) silicon-measured
Engine + FFN-glue co-resident and cooperating bit-exact end-to-end, 45/50 BRAM silicon-measured
CPU / GPU energy baselines 4.62 / 3.67 J/token measured (RAPL / nvidia-smi)
FFN activation sparsity, and that it’s unstructured 59.8%, 94% data-dependent measured (model analysis)
System energy, host-split (the loss) 4.32 J/token (1.2× worse) derived (measured cycles × est. power)
System energy, +on-fabric attention 1.99 J/token (~1.8× under GPU) derived
System energy, +on-fabric FFN glue 1.62 J/token (~2.3× under GPU) derived
Engine-bound floor (RMSNorm also on fabric) 1.47 J/token (~2.5× under GPU) projected
Full three-accelerator loop in one bitstream projected (doesn’t fit a 35T; needs tiling/bigger board)
Relu-fied sparsity upside (10–20× on the FFN) 85–95% sparse projected (needs a fine-tune)

One caveat deserves to be stated loudly, because it touches every energy number: power is the one quantity I did not meter. The 0.489 W I multiply cycles by is Vivado’s vectorless post-route estimate, not a reading from a current probe. So every joule-per-token here is honestly “measured cycle counts times an estimated wattage.” The cycle path is silicon-measured end-to-end; closing the loop with a metered watt is the single highest-value thing left to do, and I’d treat the energy ratios as good-to-~20% until then.

With that said: the load-bearing surprise — that a GPU extracts almost nothing from ternary weights (3.67 J/token, barely beating a CPU) — is directly measured, and the engine differentiator (0 DSP, 1 cycle/tile, sub-watt) is directly measured. The ~2.3× system win is derived from those, not asserted.

What I Learned

  1. Memory is the cost, not arithmetic. Every advantage in this project comes from moving fewer bytes — ternary weights, skipped sparse columns — not from doing math faster. The roofline said batch-1 decode is bandwidth-bound before I wrote a line of RTL, and it was right the whole way down. If you take one thing from this post, let it be that the interesting lever in edge LLM inference is the memory system, and that’s a lever a tiny device can pull.

  2. The honest pivot beat the optimistic projection. The most valuable measurement I made was the one where the FPGA lost — 4.32 J/token, 1.2× worse than the GPU, in Phase 4. A projection would have quietly rounded that away. Measuring it converted “attention should probably be on the fabric” into a quantified necessity and set the agenda for the three phases that produced the actual win. Build the thing that can embarrass you.

  3. Label every number by how you got it. Silicon-measured, derived, projected. Keeping those tiers separate — in the build log, in the README, in this post — is the entire difference between a result and a press release. It also makes the open work obvious: the projected rows above are the to-do list.

  4. A memory’s read and write must live in the same clocked block, or the synthesis tool builds a flip-flop array with an address decoder instead of a block RAM and detonates your resource budget. This bug cost me a wasted synthesis run three separate times — on three different modules — before it finally stuck. Some lessons you learn once; this one I had to learn thrice.

  5. Integration tests earn their keep. Three unit tests passed and all three missed the one hard-coded constant (0xAA55 summing to zero) that made the first on-board run print garbage. The bug only existed at the seam the unit tests didn’t cover. On hardware, where a wrong bitstream is a ten-minute round-trip, the test that exercises the whole path is worth ten that exercise pieces.

  6. Suspect the boring thing first. I lost the better part of a day to “soft-float math is too slow” when the real cause was a single missing line that enables interrupts, stalling the serial port. The exotic explanation is seductive precisely because it’s interesting. Check the plumbing before the theory.

What I’d Build Next

In rough order of value:

Acknowledgments

This project stands on a specific stack of prior work: the BitNet b1.58 line that made ternary LLMs trainable and good, and the FPGA-inference papers — TerEffic, TeLLMe, ProSparse — that established the LUT-based ternary datapath and the sparsity numbers I measured against. The open toolchain did the heavy lifting: LiteX and LiteDRAM for the SoC and the DDR3 controller, VexRiscv for the soft CPU, cocotb and Verilator for test-driven simulation, openFPGALoader for flashing, and AMD’s Vivado for synthesis. None of this would be approachable on a hobbyist budget without them.

The full source is on GitHub under Apache-2.0, including the dated build log that every anecdote here came from, and the reproduction scripts. If you’d argue with any number in this post — especially the energy ratios, until that metered watt exists — open an issue. That’s how the next iteration gets honest.