hls4ml: Accelerating ML for FPGAs & ASICs

Updated 20 December 2025

hls4ml is an open-source library that translates trained ML models into synthesizable HLS code, bridging high-level frameworks with low-level hardware design.
It supports diverse model types, quantization, and pruning methods to optimize latency, resource, and power usage for real-time, edge, and quantum applications.
The tool employs a modular three-phase pipeline, extensive operator libraries, and surrogate estimation techniques for rapid design-space exploration and efficient accelerator synthesis.

hls4ml is an open-source, modular software platform that automates the translation of trained ML models from popular frameworks into synthesizable, deeply-pipelined dataflow C++ or SystemC for high-level synthesis (HLS), enabling deployment on field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Its core competency lies in bridging the abstraction gap between high-level ML frameworks (Keras, TensorFlow, PyTorch, ONNX) and low-level hardware design, targeting domains that require stringent latency, resource, and power constraints such as real-time physics triggers, embedded inference at the edge, and quantum readout systems (Schulte et al., 1 Dec 2025, Fahim et al., 2021).

1. Architecture and Supported Workflows

hls4ml follows a modular, three-phase conversion pipeline:

Front-End Frameworks: Models are parsed from Keras (2/3), PyTorch (via torch.fx or ONNX export), QKeras (for quantized models), Brevitas, and QONNX/HGQ. The model graph is ingested into a common intermediate representation (ModelGraph), which encodes weights, layer types, quantizers, and shape metadata (Schulte et al., 1 Dec 2025).
IR Optimization: The ModelGraph undergoes optimization passes (precision propagation, batch-norm folding, dead-node removal, quantizer inference), supporting both homogeneous and heterogeneous quantization, along with batch-norm-fused operator instantiation (Fahim et al., 2021, Schulte et al., 1 Dec 2025).
Backend Code Generation: Target-specific HLS kernel templates are emitted for desired flows: Xilinx Vitis/Vivado HLS, Intel oneAPI HLS (Quartus HLS legacy support), or Siemens Catapult HLS (for ASIC synthesis). Each backend implements tool-specific pragma insertion, fixed-point/integer/floating-point type mapping, dataflow streaming, AXI4/AXIS IO wrappers, and TCL automation (Curzel et al., 2021, Schulte et al., 1 Dec 2025).

The configuration interface is Pythonic (YAML or dict), exposing layer- and model-level controls: datatype descriptors, reuse factors (parallelism to resource folding control), HLS “strategy” (latency/resource/distributed arithmetic), clock period, IO topology (io_stream/io_parallel), and backend target (Schulte et al., 1 Dec 2025, Fahim et al., 2021).

2. Quantization, Pruning, and Model Compression

hls4ml supports an extensive suite of quantization and model compression methodologies:

Post-Training Quantization (PTQ): Static, layer-wise assignment of fixed-point bit-widths, often expressed as ap_fixed<m,i> (total width, integer bits). hls4ml performs dynamic range scanning to guide bit allocation, with typical Q<em>m</em>.<em>f</em> encoding for fixed-point (Fahim et al., 2021, Aarrestad et al., 2021).
Quantization-Aware Training (QAT): Integration with QKeras, Brevitas, and QONNX imports quantization operators and per-layer bit-widths directly from the ML training pipeline, achieving high accuracy retention even at low-precision (e.g., 6–8 bits, or even binary/ternary for certain tasks) (Schulte et al., 1 Dec 2025, Guglielmo et al., 24 Jan 2025, Guglielmo et al., 2020).
Heterogeneous Quantization (HGQ/AutoQKeras): AutoQKeras random/bayesian search co-optimizes per-layer bit-widths and resource usage for given accuracy targets, often employing a mix of 4–8 bit quantizers (Ghielmetti et al., 2022).
Pruning and Sparsity: Quantization-aware pruning routines (lottery ticket, magnitude-based) allow aggressive weight sparsity, supporting coordinate list (COO) and unstructured implementations for further resource reduction (Fahim et al., 2021).

In convolutional networks, line-buffered convolution and FIFO-depth optimization are utilized to minimize BRAM usage in large feature maps (Ghielmetti et al., 2022). For BNN/TNNs, XNOR and popcount operations and compile-time thresholding are leveraged to fully eliminate DSP usage (Guglielmo et al., 2020).

3. Operator Library, Dataflow Optimizations, and Resource Control

The core operator library in hls4ml employs C++ templates instantiated with front-end assigned parameters:

Supported Layer Types: Fully-connected (Dense), Convolutional (Conv1D/2D), BatchNorm, Pooling, Activation (ReLU, sigmoid, tanh, softmax), Normalization (BatchNorm/LayerNorm), LSTM/GRU, multi-head attention (Transformers), upsampling/concatenation (U-Net), and boosted decision trees (BDT) (Schulte et al., 1 Dec 2025, Jiang et al., 8 Sep 2024, Aarrestad et al., 2021, Summers et al., 2020).
Pragma-based Dataflow: Extensive use of #pragma HLS PIPELINE II=1, #pragma HLS dataflow, #pragma HLS ARRAY_PARTITION, and #pragma HLS UNROLL enables full pipelining across and within layers, maximizing throughput (often one inference per clock cycle in fully unfolded configurations) (Curzel et al., 2021, Aarrestad et al., 2021, Jiang et al., 8 Sep 2024).
Reuse Factors/Strategies: Per-layer reuse_factor controls resource/time trade-off—lower values yield higher parallelism and lower latency at increased DSP/LUT/FF usage; higher values fold operations temporally to fit tighter budgets at increased latency (latency ∝ reuse, throughput = f_clk / II) (Schulte et al., 1 Dec 2025, Curzel et al., 2021, Shi et al., 2023).
Mixed-Precision and Back-End Portability: ac_types replaces ap_types for tool-agnostic fixed/floating-point support (self-contained C++), with activation LUTs constructed using constexpr/gcem to support multiple HLS backends (Vivado, Vitis, Bambu, Catapult) (Curzel et al., 2021).
Distributed Arithmetic (DA): For ultra-low resource mapping, DA emulates MACs with LUT adder trees, eliminating DSP usage and favoring small/medium bitwidths (Schulte et al., 1 Dec 2025, Ghielmetti et al., 2022).

The system is able to fully pipeline even complex constructs such as multi-head attention, layer normalization, and BDT ensembles, partitioning arrays to allow concurrent access and streaming results through fully dataflowed AXI interfaces (Jiang et al., 8 Sep 2024, Summers et al., 2020).

4. Supported Model Classes and Application Domains

hls4ml encompasses an array of model structures:

Feedforward (Dense/MLP): Used in jet tagging, anomaly detection, and control systems. Typical configurations achieve <0.5–5 μs latency, using <10 % FPGA resources at 6–8-bit quantization (Fahim et al., 2021, Schulte et al., 1 Dec 2025).
CNNs: Compressed, quantized ENet variants for image segmentation run at 4.8–4.9 ms per frame (semantically segmented Cityscapes images), consuming <30 % ZCU102 resources, with optimal configurations using layer-wise mixed quantization (Ghielmetti et al., 2022).
RNNs (LSTM/GRU): Fully pipelined, high-throughput inference for top/flavor tagging or QuickDraw series—latency in the 1.7–35 μs regime for fully parallel, with resource/latency trade-offs via reuse or unrolling (Khoda et al., 2022).
Transformers: Attention, softmax, and layernorm kernels mapped to pipelined HLS functions; 1.9–3.5 μs end-to-end for moderate-length sequences, at dynamic power 8–12 W (VU13P), orders of magnitude lower than comparable GPU inference (Jiang et al., 8 Sep 2024).
BDTs: Fully unrolled comparator and adder trees enable 52–62 ns latency for 100-tree depth-4 ensembles, zero DSP use for fixed-point, sub-10 % LUTs of large FPGAs (Summers et al., 2020).
Edge Inference and Quantum Readout: For Arria 10 SoC and Xilinx RFSoC (e.g., QICK platform), multi-layer networks with tailored quantization/unrolling achieve sub-millisecond (1.57–1.74 ms) and nanosecond-scale latency (32 ns), maintaining high fidelity (>96 %) at low resource impact (Shi et al., 2023, Guglielmo et al., 24 Jan 2025).

hls4ml-generated accelerators are validated in production settings covering LHC trigger systems, real-time accelerator control, and embedded quantum measurement.

5. Performance Results, Surrogate Estimation, and Design-Space Exploration

hls4ml has enabled the synthesis, placement, and empirical characterization of hundreds of thousands of designs, driving a public benchmark (wa-hls4ml) and a suite of resource/latency estimation surrogate models (Hawks et al., 6 Nov 2025).

Performance Highlights: QAT/hybrid quantization and careful reuse selection yield >90 % accuracy with 90–99 % DSP resource reduction, 5–10× lower dynamic power than GPU inference, and minimal latency (<5 μs typical for L1-T applications, <200 ns for binarized MLPs, <5 ms for image segmentation pipelines) (Aarrestad et al., 2021, Ghielmetti et al., 2022, Schulte et al., 1 Dec 2025).
Resource/Latency Estimation: wa-hls4ml benchmark exposes 680k+ network samples with post-synthesis resource/latency numbers; transformer and GNN surrogate predictors achieve SMAPE around 2.9 % (LUT/FF) and R² >0.9 for latency/II, allowing rapid architecture space iteration without full HLS synthesis (Hawks et al., 6 Nov 2025).
Optimization Formulas: hls4ml provides resource and latency estimates via:
- LUTs ≃ α · N_MAC + β · N_ADD + γ · N_ACT (empirically fit)
- Throughput = f_clk / II
- DSPs = ceil(Ops / reuse)
- Memory footprint: RAM_bits = ∑ layer(N_weights × bit_width)
Design loops are accelerated by integrating these surrogates into the hls4ml CLI and Python API.

6. Extensibility, Device Support, and Ecosystem

hls4ml is designed for backend and front-end extensibility:

Supported HLS Compilers: Vitis HLS (all AMD/Xilinx series), Intel oneAPI/Quartus HLS (Arria, Stratix, Agilex), Catapult HLS (all major ASIC flows), Bambu (open-source), targeting high-end (UltraScale+, Alveo), embedded (Zynq, Arria 10 SoC), and quantum (RFSoC) devices (Schulte et al., 1 Dec 2025, Curzel et al., 2021).
Ecosystem Integration: Compatible with AutoQKeras, MetaML-Pro (NAS and resource-constrained optimization), Coyote v2 (multi-FPGA shells), and planned deep integration with on-chip weight loading and Vitis AI (Schulte et al., 1 Dec 2025, Jiang et al., 8 Sep 2024).
User Implications: Migration is non-intrusive—existing API calls remain, but users gain additional datatype descriptors, tool-switching (e.g., hls4ml.configure(hls_tool="bambu")), and more explicit YAML/config options for floating-point and mixed-precision control (Curzel et al., 2021). Reports are enriched with both resource utilization and per-layer expected accuracy deltas.

Table 1: Summary of Major hls4ml Capabilities

Feature	Details / Supported Options
ML Frameworks	Keras, PyTorch, ONNX, QKeras, Brevitas, QONNX, HGQ
Layer Types	Dense, Conv1D/2D, Pool, Activation, BatchNorm, LSTM, GRU, MHA, BDT
Quantization	PTQ, QAT, auto-mixed, BNN/TNN, HGQ per-layer
Pruning/Sparsity	Magnitude-based, lottery ticket, coordinate-list (COO)
HLS Compilers	Vitis HLS, Vivado HLS, Intel oneAPI/Quartus HLS, Catapult, Bambu
Target Devices	Xilinx (Zynq, UltraScale+, Alveo), Intel (Arria, Stratix), ASIC, RFSoC
Dataflow/IO	io_stream/FIFO, io_parallel, AXI4-Stream, AXI4-Lite, custom MM
Resource/LAT Trade-off Knobs	reuse_factor, per-layer precision, strategy (latency/resource), DA
Surrogate Estimation	wa-hls4ml Transform/GNN/MLP, resource/latency within few %

7. Outlook and Future Directions

hls4ml is advancing toward wider FPGA and ASIC coverage, finer-grained quantization, deeper integration with automated resource-accuracy design (surrogate modeling and AutoML), tool-agnostic code generation, and native support for high-throughput, sparse, or long-sequence models (LLMs, large RNNs) (Schulte et al., 1 Dec 2025, Hawks et al., 6 Nov 2025, Curzel et al., 2021, Jiang et al., 8 Sep 2024). Current development is focused on expanding backend support (e.g., integrating non-Xilinx/Intel FPGAs and alternative HLS flows), supporting dynamic precision and on-chip weight management for LLMs, and tighter coupling with co-design and neural architecture search frameworks. These directions position hls4ml as a foundational layer for scientific, industrial, and quantum edge-acceleration on reconfigurable hardware.