Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeLLMe FPGA Pipeline Accelerator

Updated 20 February 2026
  • TeLLMe FPGA Pipeline is a specialized streaming accelerator architecture that enables low-bit LLM inference on power- and area-constrained FPGAs.
  • It employs a table-lookup ternary matrix multiplication engine with pipelined LUT and URAM techniques to optimize throughput and resource balance.
  • The design integrates fused attention, normalization, and adaptive weight buffering to meet strict power budgets while scaling both prefill and decode operations.

The TeLLMe FPGA pipeline refers to a highly specialized, deeply pipelined streaming accelerator architecture for low-bit quantized LLM inference on power- and area-constrained FPGAs. Its technical innovations center on a ternary-weight (1.58-bit) matrix multiplication engine using table-lookup techniques, combined with bandwidth-optimized, fused attention and normalization units. TeLLMe’s architecture supports both the high-throughput “prefill” and autoregressive “decode” stages of LLM inference within stringent power budgets, leveraging on-chip URAM and LUT logic for efficiency and scalability. The following sections detail its architectural principles, core pipeline modules, quantitative performance, and design trade-offs as reported in (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025).

1. Pipeline Datapath and Streaming Microarchitecture

TeLLMe’s accelerator instantiates an end-to-end streaming pipeline, in which each major transformer operation is mapped to a dedicated stage with minimal buffering between modules. The primary modules are:

  • Q-Norm–Quant/Dequant Unit: Performs RMSNorm followed by 8-bit symmetric “AbsMax” quantization to produce INT8 activations and corresponding scale factors.
  • Ternary MatMul Engine (“TLMM” or “TL-table”): Implements linear layers using a table-lookup dot-product method, exploiting grouped activations and ternary weight indices for efficient data reuse.
  • Attention Engine (Prefill/Decoding): Fuses softmax, masking, and QKT/SVT computation using a reversed traversal scheme for memory efficiency.
  • MLP Nonlinearities and Output: Fused in-pipeline activation functions (e.g., SiLU), final dequantization, and output token selection.

All inter-module data movement uses AXI4-Stream FIFOs of fixed width (e.g., 256 bits), supporting dataflow with minimal cycle-level stalls. The system is orchestrated by a central controller that manages memory requests and kernel scheduling for both prefill (batch) and decode (autoregressive) operation modes (Qiao et al., 22 Apr 2025, Qiao et al., 3 Oct 2025).

2. Table-Lookup Ternary Matrix Multiplication (TLMM) Engine

The TLMM engine consists of a collection of LUT-centric, fully pipelined matmul units that perform ternary-weighted matrix-vector and matrix-matrix multiply operations as follows:

  • Weight Table Encoding: The ternary weight matrix W{1,0,+1}n×kW\in\{-1,0,+1\}^{n\times k} is partitioned into GG-sized groups. Each group of GG ternaries is pre-encoded offline as an index in 03G10\dots 3^G-1, and stored in URAM as bitpacked vectors.
  • Online Activation Grouping: At runtime, each activation group AgZGA_g\in \mathbb{Z}^G is used to precompute a TL table of all 3G3^G possible partial sums (linear combinations of AgA_g with {±1,0}\{\pm1,0\} signs).
  • Index Lookup and Accumulation: For each output position, the relevant group’s index is retrieved from URAM, the corresponding sum is looked up from the TL table, and partial sums are accumulated to form the final result. This is pipelined with initiation interval (II) = 1, such that every cycle, a group of TT outputs is produced.
  • Parallelism: TT (table vectorization) and QQ (number of LUT table reads per cycle) define the parallel throughput; their values are tuned to saturate available LUT and URAM bandwidth.

The design removes DSP dependence for matmul, concentrating compute in LUTs/URAM to preserve limited DSPs for other operations and exploit FPGA-specific resource balances (Qiao et al., 3 Oct 2025, Qiao et al., 22 Apr 2025).

3. Streaming Attention: Reversed-Reorder Prefill and Lightweight Decode

TeLLMe employs optimized attention pipelines, distinct for prefill and decoding scenarios:

  • Prefill (“RPA”):
    • Implements a flash-style, blockSize=1 online streaming attention kernel traversing queries in reverse order (from last to first).
    • For each query, all keys and values are streamed sequentially through fused MAC and softmax pipelines. The single pass eliminates the need for N×NN\times N attention matrices, reducing both on-chip SRAM and DRAM bandwidth by up to 2×\times.
    • Outputs per head are accumulated via parallel prefix and softmax units.
  • Decoding (“DA”):
    • Implements a specialized pipeline for single-token inference. Cached KK, VV memories are streamed from DDR, and attention scores/softmax are computed on-the-fly with outputs reduced using channel FIFOs — maintaining sub-microsecond kernel cycle counts.

Throughput is maximized by matching cycle-level data movement with streaming compute across all kernels, and by overlapping DRAM memory accesses with FPGA-side computation via fused operator design (Qiao et al., 3 Oct 2025).

4. Adaptive Weight Buffering and Resource Management

A distinctive feature is the fine-grained, analytic management of on-chip URAM via the Weight Buffer Manager (WBMU):

  • URAM Partitioning: Ternary index vectors, stored as bitpacked streams, are partitioned so that each TLMM lane accesses its own URAM bank. Dual-porting and cascade factors (cURAMc_{\mathrm{URAM}}) are used to maximize simultaneous access.
  • Bandwidth Adaptation: Prefetch and burst control logic decouple external DDR bandwidth from kernel throughput by overlapping data load with compute, hiding memory latencies behind compute pipelines.
  • Resource Formulas: Selection of (T,Q,G)(T, Q, G) is subject to device constraints, balancing LUT, URAM, BRAM, and DSP counts for best pipeline saturation. For example: TLMM-FUSE module uses 43,137 LUT, 51,894 FF, 5.5 BRAM, 320 DSP, and 0 URAM (for the TLMM kernel itself) on the AMD Kria KV260 platform (Qiao et al., 3 Oct 2025).

5. Integrated Quantization, Normalization, and Operator Fusion

Normalization and quantization–dequantization are fully integrated as streaming, pipelined modules:

  • RMSNorm and Quantization: Operate in two passes, using a global max and root-mean-square reduction, followed by per-element scaling and truncation (to 8-bit signed). Both steps share multiplier pipelines for area efficiency.
  • Dequantization: Implemented inline in TLMM-FUSE and post-processing, applying stored Δ\Delta scale factors directly in the minor accumulation steps.

Elementwise operators such as GELU/SILU, RoPE, and channel-wise max/reduce are fused into major matmul and attention units to minimize extra buffering and pipeline stages (Qiao et al., 3 Oct 2025).

6. Quantitative Performance and Resource Metrics

Reported metrics demonstrate the end-to-end efficiency of the pipeline on the AMD Kria KV260 SoC, for BitNet models with 1.58-bit weights and 8-bit activations:

Mode Throughput Power (W) Efficiency (tk/J) TTFT (64–128)
Prefill 143 tk/s 4.8 29.8 0.45–0.96 s
Decode 25 tk/s 4.8 5.2 See above

Overall resource usage can reach 84% LUT, 28% FF, 68% BRAM, 94% URAM, 49% DSP (for the full pipeline at evaluated scale) (Qiao et al., 3 Oct 2025). Prefill is compute-limited, whereas decode is DRAM bandwidth-limited as context lengths increase.

7. Design Trade-offs, Bottlenecks, and Scaling Properties

The table-lookup matrix multiplication paradigm (using LUTs rather than DSPs for matmul) allows for efficient exploitation of modern edge-FPGA resource mixes. The bandwidth bottleneck in decoding constrains maximum context lengths or model width before off-chip traffic dominates, a property determined by the O(Md/BW)O(Md/\mathrm{BW}) scaling of decode latency. In contrast, prefill performance scales as O(d2/DSP)O(d^2/\text{DSP}).

To maximize end-to-end throughput, resource allocation between prefill (batch, matrix-matrix) and decode (single-token, matrix-vector) is tuned by adjusting parallel TLMM lanes (TT), URAM bank width, and attention head parallelism.

A plausible implication is that the TeLLMe architecture, through its combination of low-bit quantization, streaming depth, and operator fusion, sets a design pattern for future edge-device LLM inference accelerators, achieving fully on-device transform-based generation with energy footprints (<<7 W) and memory budgets previously impractical for this class of models (Qiao et al., 3 Oct 2025, Qiao et al., 22 Apr 2025).


References:

  • TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs (Qiao et al., 3 Oct 2025)
  • TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs (Qiao et al., 22 Apr 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeLLMe FPGA Pipeline.