HLSTransform: FPGA Accelerator for LLMs
- HLSTransform is a high-level synthesis–based FPGA accelerator that implements the entire transformer inference pipeline for large language models.
- It integrates advanced HLS optimizations such as loop pipelining, loop unrolling, and array partitioning to maximize throughput and hardware efficiency.
- The design employs int8 quantization with minimal accuracy loss, enabling sustainable, low-power inference for edge and embedded applications.
HLSTransform is a high-level synthesis (HLS)–based field-programmable gate array (FPGA) accelerator architecture specifically designed for energy-efficient inference of LLMs, exemplified by Llama 2, at scale. HLSTransform applies HLS to implement the entire transformer inference pipeline on FPGAs, optimizing both hardware resource allocation and algorithmic efficiency. This enables significant reductions in energy consumption and inference latency compared to conventional CPU and GPU platforms, addressing the growing demand for sustainable machine learning deployment, especially in scenarios constrained by energy budgets or system heat dissipation, such as edge and embedded computing environments (He et al., 2024).
1. Architectural Overview
HLSTransform comprises a tightly integrated host/kernel split in which the host software operates on conventional x86 infrastructure (e.g., AWS t2.2xlarge), orchestrating token sampling, prompt management, and Direct Memory Access (DMA) data transfers. The compute-intensive "kernel"—written in C++ and marked with Xilinx HLS pragmas—is synthesized automatically into register-transfer level (RTL) modules for mapping to a Xilinx Virtex UltraScale+ VU9P FPGA.
The FPGA kernel implements a forward pass of the 110M-parameter Llama 2 transformer. Each transformer layer—comprising RMSNorm, QKV projections, rotary-position embedding, scaled-dot-product attention, softmax, context matrix multiplication, output projection, residual additions, feed-forward multilayer perceptron (MLP) with SwiGLU activation, and subsequent residual additions—is structured as a pipeline of HLS functions. Dense linear algebra (matrix-vector/matrix-matrix multiplications) exploits on-chip line buffers and DSP slices.
To maximize throughput and parallelism, key tensors, such as attention weights and MLP parameters, are quantized post-training to int8 and streamed into local on-chip buffers via wide (256-bit, 32-int8s/cycle) AXI4-burst reads. Intermediate results, such as rotary-encoded queries/keys and softmax values, are computed using floating-point or integer HLS operators as appropriate.
2. HLS-Specific Hardware Optimizations
HLSTransform applies several HLS-level optimizations to optimize inference performance and FPGA resource utilization:
- Loop Pipelining: Critical loops in the compute path (e.g., matrix multiplications, attention, rotary embedding) are annotated with
#pragma HLS pipeline II=1, enabling initiation interval 1 where practical. This achieves, for example, II=50 for the inner accumulation of a 768×768 matmul, but overall end-to-end throughput of one output per clock after pipeline fill. - Loop Unrolling: Inner loop unrolling (e.g., unrolling the 768-element reduction in RMSNorm by a factor of 8) yields up to 8× throughput at the expense of increased LUT/DSP usage, particularly on smaller-reduction operations.
- Array Partitioning:
#pragma HLS array_partition variable=A complete dim=1splits int8 weight buffers across multiple BRAM banks, supporting concurrent reads from four or eight weights per clock cycle. This is essential for parallelizing multi-channel dot-products without single-bank bottlenecks. - AXI4 Burst Transfers / Bus Widening: Wide 256-bit memory interfaces are utilized to amortize off-chip latency, ensuring that the compute pipeline is not I/O bound.
These measures enable HLSTransform to sustain token throughput rates of 57.11 tokens/s at 200 MHz on the VU9P FPGA, utilizing approximately 30% of available LUTs, 28% of DSPs, and 35 MB of BRAM.
3. Quantization and Numerical Efficiency
All transformer projection and MLP weights are quantized using the Q8_0 symmetric post-training quantization format, w_q = round(127 * (w / ∥w∥_∞)), with int8 range [–127, +127]. Activations produced within matrix multiplications accumulate in int32 and are requantized or transformed to float for subsequent softmax or SwiGLU layers. RMSNorm parameters (γ, β) are retained in float32 to prevent quantization-induced norm drift.
Comprehensive evaluation shows that quantization has a negligible effect on model perplexity (<0.04% loss), preserving predictive utility while reducing memory footprint and allowing larger models to fit within on-chip constraints.
4. Performance and Energy Metrics
Performance is characterized across throughput, latency, and energy domains, with comparisons against server-grade CPUs (Intel Xeon Broadwell E5-2686 v4) and consumer GPUs (NVIDIA RTX 3090):
- Throughput: HLSTransform sustains 57.11 tokens/s (batch size 1) on the FPGA, outperforming the CPU by 2.46× and achieving 0.53× the throughput of the RTX 3090. Latency per 256-token batch is 17.51 ms on FPGA (vs. 43.08 ms CPU and 9.34 ms GPU).
- Energy Efficiency: Average power consumption for a 256-token inference is 9.0 W (FPGA), compared to 42.5 W (CPU) and 126.9 W (GPU). The energy cost per token is 0.04 mWh (FPGA), yielding a 12.75× reduction over CPU (0.51 mWh/tok) and an 8.25× reduction over GPU (0.33 mWh/tok).
- Resource Utilization: On a VU9P FPGA at 200 MHz, usage is ~310,000 LUTs (30%), ~380,000 FFs (35%), 750 DSPs (28%), and ~400 BRAM/URAM blocks (≈300 Mb buffer space).
These reductions are achieved without significant loss of model accuracy or inference fidelity.
5. Trade-Offs, Limitations, and Applicability
Precision reduction (to 8-bit quantization) is a key design lever, yielding substantial memory footprint and resource savings while introducing <0.1% perplexity degradation. More aggressive quantization (e.g., 4-bit or integer-only schemes) could further improve resource utilization, but with unquantified accuracy impact.
Throughput is limited by FPGA clock speed (200–300 MHz) relative to GPUs (1.4 GHz base), yielding lower absolute speed but much higher energy efficiency. On-chip memory restricts model size to ≈110M parameters; scaling to larger LLMs requires sharding, multi-FPGA solutions, or further quantization advances.
HLSTransform is thus well suited for deployment scenarios prioritizing power/cost efficiency over peak throughput, such as battery-operated systems, real-time embedded inference, or sustainability-focused datacenter deployments.
6. Open-Source Ecosystem and Replication
HLSTransform is available as open-source at github.com/HLSTransform/submission. The codebase is organized into:
host/: C++ kernel driver (XRT API)kernel/: C++ HLS modules for all transformer subcomponents (forward, matmul, normalization, softmax, SwiGLU, quantization)build/: Vitis HLS project scripts for C-to-RTL synthesis
Replication is documented for AWS f1.2xlarge systems with Vitis 2023.x. The prescribed workflow includes cloning, synthesizing with vitis_hls, preparing host binaries, generating .xclbin FPGA kernels, and AWS biscan compilation. At runtime, the host allocates device buffers, streams token batches, and reads back logits.
7. Prospective Extensions
Potential future directions for HLSTransform include deploying more aggressive quantization (e.g., Q4_0 or fully-integer I-BERT quantizations) to fit larger transformer models on-chip, multi-FPGA pipelining for model/sharding parallelism, optimizing for batched inference (batch size >1), implementing flash or sparse attention to reduce bandwidth, and integrating with high-level Python-based machine learning workflows (e.g., hls4ml) to further democratize FPGA transformer inference (He et al., 2024).
These directions would expand the utility of the HLSTransform approach, enabling more scalable and flexible deployments of large transformer architectures under tight performance and power envelopes.