Papers
Topics
Authors
Recent
2000 character limit reached

hls4ml: Hardware-Aware ML Compiler

Updated 11 December 2025
  • hls4ml is an open-source platform that transforms trained machine learning models into hardware-amenable HLS code for FPGAs and ASICs.
  • The framework applies hardware-aware optimizations such as quantization, pruning, and dataflow pipelining to achieve ultra-low latency inference.
  • It enables precise design-space exploration and resource estimation, supporting a wide range of neural network architectures for real-time machine learning applications.

hls4ml is an open-source software platform and hardware generation framework that automates the transformation of trained ML models into high-level synthesis (HLS) code suitable for deployment on field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Architecturally, hls4ml operates as a modular “compiler,” ingesting models from Keras (including QKeras), PyTorch, and ONNX/QONNX, applying hardware-aware optimizations, and emitting hardware-amenable C++ or SystemC ready for vendor HLS tools such as Vivado HLS, Intel oneAPI DPC++, and Catapult HLS. Its core strengths lie in supporting quantized and pruned neural networks, streaming dataflow architectures, sub-microsecond inference latencies, and the ability to target diverse platforms, making it a key tool for domains demanding extreme resource efficiency and real-time inference—most notably scientific instrumentation and edge computing applications (Schulte et al., 1 Dec 2025, Fahim et al., 2021).

1. System Architecture and Software Workflow

hls4ml is architected around a multi-stage compilation flow: front ends for parsing ML models, a sequence of hardware-oriented optimizations, and back ends for code generation targeting major HLS toolchains. The frontend parsers support Keras (all TensorFlow versions, QKeras quantizers), PyTorch (FX, Brevitas, HAWQ), and ONNX/QONNX, extracting weights, quantization parameters, graph topology, and layer configuration into a device-agnostic internal representation (“ModelGraph”).

Subsequent optimizer flows perform transformations such as affine+BatchNorm fusion, precision propagation, channel format transpositions, per-layer bitwidth inference, constant folding, streaming FIFO depth minimization (via RTL-level occupancy profiling), and the splitting of graphs for multi-component hardware synthesis. User configuration is provided as Python dictionaries or YAML files, setting parameters such as per-layer precision, reuse factors, dataflow vs. parallel I/O, distribution of resources, and synthesis strategies (‘Latency’, ‘Resource’, or ‘Distributed Arithmetic’).

The backend maps the IR to hand-optimized C++/SystemC templates for each supported layer, emits HLS pragmas (e.g., #pragma HLS dataflow, #pragma HLS pipeline II=1, #pragma HLS array_partition), generates vendor toolchain scripts, and writes test benches for simulation/cosimulation. hls4ml supports direct flow for FPGAs via Vivado/Vitis (Xilinx), Intel oneAPI/Quartus, and Catapult HLS (Siemens) for both FPGA and hierarchical ASIC flows (Schulte et al., 1 Dec 2025, Fahim et al., 2021, Curzel et al., 2021).

2. Hardware Mapping Strategies: Dataflow, Pipelining, and Quantization

The hardware designs produced by hls4ml are deeply pipelined, dataflow-oriented accelerators that maintain all network parameters and activations in on-chip memory—completely obviating DRAM access in real-time applications. Each major operator (Dense, Conv2D, GRU, LSTM, MultiHeadAttention, Pooling, BatchNorm, etc.) is instantiated as an HLS module, connected via streaming interfaces (e.g., hls::stream). Layer-to-layer streaming and pipelining are orchestrated with HLS dataflow pragmas to parallelize both intra- and inter-layer computation.

Precision across weights, activations, and accumulators is flexibly controlled at the per-layer level through fixed-point (ap_fixed<W,I>), binary, ternary, or even custom floating-point datatypes via QKeras, HGQ, or ONNX/QONNX quantizer metadata (Schulte et al., 1 Dec 2025, Guglielmo et al., 2020, Campos et al., 2023). hls4ml supports explicit configuration of accumulator widths using analytical formulas incorporating input/output shapes and quantizer bit-widths to guarantee no overflow during MAC operations.

Key backend strategies include:

  • Latency strategy: maximum parallelization/unrolling (Reuse factor RF=1, II=1), minimized inference latency, maximal instantaneous resource usage (Schulte et al., 1 Dec 2025).
  • Resource strategy: increased reuse factor, time-multiplexed multiplication pipeline, II>1, reduced DSP/LUT/BRAM consumption.
  • Distributed Arithmetic (DA): via the integrated da4ml algorithm, replacing constant-matrix–vector multiplies with shift-and-add LUT logic, trading off DSPs for LUTs and enabling resource reduction up to 30% for highly quantized models (Sun et al., 6 Jul 2025).

3. Neural Network Types and Supported Model Classes

hls4ml supports a broad spectrum of neural architectures:

  • Dense/MLP: Fully connected networks with per-layer quantization and pruning—binary/ternary and hybrid quantization are natively supported with automatic resource elimination of DSPs when possible (Guglielmo et al., 2020).
  • Convolutional Neural Networks (CNNs): Streaming 1D/2D/3D convolutions using hardware-efficient shift-register line buffers, supporting compressed and quantized kernels with automated BRAM/LUT balancing; see semantic segmentation with ENet on ZCU102 at sub-5 ms latencies (Ghielmetti et al., 2022, Aarrestad et al., 2021).
  • Recurrent Neural Networks: LSTM and GRU with user-controlled static/non-static modes and reuse factors, enabling latency-resource trade-off and achieving <2 μs inference for jet tagging and other scientific workloads (Khoda et al., 2022).
  • Graph Neural Networks: Automated lowering of Keras/ONNX GNN topologies, including message-passing–style interaction networks for charged particle track reconstruction (<1 μs latency on Kintex UltraScale) (Heintz et al., 2020).
  • Transformer Architectures: Multi-head attention, softmax, and (optionally) layer normalization, with specialized LUT-based implementations for softmax and exp/inverse, enabling sub-2 μs inference on UltraScale+; arbitrary Keras/ONNX transformers can be mapped by adjusting per-layer configuration (Jiang et al., 1 Feb 2024, Jiang et al., 8 Sep 2024).
  • Boosted Decision Trees: Direct support for BDTs from scikit-learn/XGBoost/TMVA, instantiated as parallel combinational logic trees, with <100 ns latency for Level-1 triggers (Summers et al., 2020).

4. Model Compression, Quantization, and Design-Space Exploration

hls4ml is tightly coupled with quantization- and pruning-aware training workflows:

  • QAT/PTQ: Integration with QKeras, HGQ, Brevitas, and ONNX/QONNX allows both post-training and quantization-aware training, including asymmetric, uniform, heterogeneous, or Hessian-aware mixed-precision schemes (Campos et al., 2023, Schulte et al., 1 Dec 2025, Guglielmo et al., 2020).
  • Pruning: Model-optimizer passes support structured and unstructured sparsity. Parameters with magnitude below threshold are pruned and corresponding hardware computation is dropped (Fahim et al., 2021, Weitz et al., 9 Jan 2025).
  • Automated Design Sweeps: Python/YAML configuration variables allow global and per-layer sweeps of bit-width, reuse-factor (“parallelism” vs. “latency/resource”), precision, and resourcing strategies.
  • Surrogate Modeling: wa-hls4ml and rule4ml provide neural and regression surrogate models for instantaneous prediction of LUT, DSP, FF, BRAM, and latency, trained on >600,000 hls4ml synthesized networks, accurately predicting cycle counts and resources within 10–30% for prospective designs, enabling rapid architecture-hardware co-design (Hawks et al., 6 Nov 2025, Rahimifar et al., 9 Aug 2024).
Surrogate Tool Type Metrics Predicted R² Range sMAPE Range
wa-hls4ml GNN/Trans LUT, DSP, FF, BRAM, Cyc. 0.89–0.95 2.9–15.7% (test set)
rule4ml MLP LUT, DSP, FF, BRAM, Cyc. 0.8–0.98 10–30%

5. Device Backends, Performance, and Supported Applications

hls4ml supports FPGAs (Xilinx: Vivado HLS, Vitis HLS; Intel: oneAPI/Quartus HLS), and ASIC flows (Siemens Catapult HLS). Each backend supplies tailored HLS templates for dataflow-pipelined inference, with interface generators for both parallel and streaming protocols (AXI4-Stream, OneAPI pipes, custom memory-mapped buffers). By leveraging platform-specific pragmas (automatic in backends) and portable precision types (ap_fixed, ac_fixed), device support is extensible (e.g., Alveo, Zynq, UltraScale+, Arria/Agilex, and ASICs) (Schulte et al., 1 Dec 2025, Curzel et al., 2021).

Reported performance metrics include:

hls4ml has enabled a variety of scientific deployments:

6. Extensibility, Portability, and Research Ecosystem

hls4ml is designed for extensibility. New layer types, quantizers, and optimization passes can be added via an Extension API at the IR and backend template level (Schulte et al., 1 Dec 2025). Conversion routines and resource estimation flows are continually developed alongside higher-level ecosystem tools, such as wa-hls4ml (benchmark and surrogate prediction), rule4ml (fast regression predictors), da4ml (distribution arithmetic), and experimental improvements in backend portability (constexpr-based LUTs, custom floating-point support, ac_fixed/ac_int, and multi-vendor pragma abstraction) (Sun et al., 6 Jul 2025, Curzel et al., 2021, Hawks et al., 6 Nov 2025, Rahimifar et al., 9 Aug 2024).

hls4ml has demonstrated robust support for a wide range of real-world ML workloads under severe hardware and latency constraints, facilitating the deployment of compressed, quantized, and low-power ML on reconfigurable and custom hardware in scientific and industrial contexts. Its contributions are documented in open-source repositories and benchmarked across contemporary accelerator toolchains (Schulte et al., 1 Dec 2025, Fahim et al., 2021, Hawks et al., 6 Nov 2025).

7. Limitations, Ongoing Development, and Outlook

Key limitations historically identified include backend specialization (Vivado-only pragmas, Xilinx-specific types), limited support for custom FP types, and code size/memory for large models. Recent work addresses these limitations by “de-specializing” the codebase, introducing portable LUT generators, ac_types, and hierarchical graph partitioning for large designs (Curzel et al., 2021, Schulte et al., 1 Dec 2025). True architectural extrapolation in surrogate predictors remains an open challenge; GNN and transformer surrogates outperform simple MLP baselines, but generalization to unseen topologies is actively researched (Hawks et al., 6 Nov 2025, Rahimifar et al., 9 Aug 2024).

Future work is expected to further automate optimal architecture-platform co-design, expand ASIC and embedded software integration, support more complex models (e.g., larger transformers, SSMs), and foster closer integration with quantization/pruning-aware training and neural architecture search workflows (Weitz et al., 9 Jan 2025, Schulte et al., 1 Dec 2025).


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to hls4ml.