hls4ml: Hardware-Aware ML Compiler
- hls4ml is an open-source platform that transforms trained machine learning models into hardware-amenable HLS code for FPGAs and ASICs.
- The framework applies hardware-aware optimizations such as quantization, pruning, and dataflow pipelining to achieve ultra-low latency inference.
- It enables precise design-space exploration and resource estimation, supporting a wide range of neural network architectures for real-time machine learning applications.
hls4ml is an open-source software platform and hardware generation framework that automates the transformation of trained ML models into high-level synthesis (HLS) code suitable for deployment on field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Architecturally, hls4ml operates as a modular “compiler,” ingesting models from Keras (including QKeras), PyTorch, and ONNX/QONNX, applying hardware-aware optimizations, and emitting hardware-amenable C++ or SystemC ready for vendor HLS tools such as Vivado HLS, Intel oneAPI DPC++, and Catapult HLS. Its core strengths lie in supporting quantized and pruned neural networks, streaming dataflow architectures, sub-microsecond inference latencies, and the ability to target diverse platforms, making it a key tool for domains demanding extreme resource efficiency and real-time inference—most notably scientific instrumentation and edge computing applications (Schulte et al., 1 Dec 2025, Fahim et al., 2021).
1. System Architecture and Software Workflow
hls4ml is architected around a multi-stage compilation flow: front ends for parsing ML models, a sequence of hardware-oriented optimizations, and back ends for code generation targeting major HLS toolchains. The frontend parsers support Keras (all TensorFlow versions, QKeras quantizers), PyTorch (FX, Brevitas, HAWQ), and ONNX/QONNX, extracting weights, quantization parameters, graph topology, and layer configuration into a device-agnostic internal representation (“ModelGraph”).
Subsequent optimizer flows perform transformations such as affine+BatchNorm fusion, precision propagation, channel format transpositions, per-layer bitwidth inference, constant folding, streaming FIFO depth minimization (via RTL-level occupancy profiling), and the splitting of graphs for multi-component hardware synthesis. User configuration is provided as Python dictionaries or YAML files, setting parameters such as per-layer precision, reuse factors, dataflow vs. parallel I/O, distribution of resources, and synthesis strategies (‘Latency’, ‘Resource’, or ‘Distributed Arithmetic’).
The backend maps the IR to hand-optimized C++/SystemC templates for each supported layer, emits HLS pragmas (e.g., #pragma HLS dataflow, #pragma HLS pipeline II=1, #pragma HLS array_partition), generates vendor toolchain scripts, and writes test benches for simulation/cosimulation. hls4ml supports direct flow for FPGAs via Vivado/Vitis (Xilinx), Intel oneAPI/Quartus, and Catapult HLS (Siemens) for both FPGA and hierarchical ASIC flows (Schulte et al., 1 Dec 2025, Fahim et al., 2021, Curzel et al., 2021).
2. Hardware Mapping Strategies: Dataflow, Pipelining, and Quantization
The hardware designs produced by hls4ml are deeply pipelined, dataflow-oriented accelerators that maintain all network parameters and activations in on-chip memory—completely obviating DRAM access in real-time applications. Each major operator (Dense, Conv2D, GRU, LSTM, MultiHeadAttention, Pooling, BatchNorm, etc.) is instantiated as an HLS module, connected via streaming interfaces (e.g., hls::stream). Layer-to-layer streaming and pipelining are orchestrated with HLS dataflow pragmas to parallelize both intra- and inter-layer computation.
Precision across weights, activations, and accumulators is flexibly controlled at the per-layer level through fixed-point (ap_fixed<W,I>), binary, ternary, or even custom floating-point datatypes via QKeras, HGQ, or ONNX/QONNX quantizer metadata (Schulte et al., 1 Dec 2025, Guglielmo et al., 2020, Campos et al., 2023). hls4ml supports explicit configuration of accumulator widths using analytical formulas incorporating input/output shapes and quantizer bit-widths to guarantee no overflow during MAC operations.
Key backend strategies include:
- Latency strategy: maximum parallelization/unrolling (Reuse factor RF=1, II=1), minimized inference latency, maximal instantaneous resource usage (Schulte et al., 1 Dec 2025).
- Resource strategy: increased reuse factor, time-multiplexed multiplication pipeline, II>1, reduced DSP/LUT/BRAM consumption.
- Distributed Arithmetic (DA): via the integrated da4ml algorithm, replacing constant-matrix–vector multiplies with shift-and-add LUT logic, trading off DSPs for LUTs and enabling resource reduction up to 30% for highly quantized models (Sun et al., 6 Jul 2025).
3. Neural Network Types and Supported Model Classes
hls4ml supports a broad spectrum of neural architectures:
- Dense/MLP: Fully connected networks with per-layer quantization and pruning—binary/ternary and hybrid quantization are natively supported with automatic resource elimination of DSPs when possible (Guglielmo et al., 2020).
- Convolutional Neural Networks (CNNs): Streaming 1D/2D/3D convolutions using hardware-efficient shift-register line buffers, supporting compressed and quantized kernels with automated BRAM/LUT balancing; see semantic segmentation with ENet on ZCU102 at sub-5 ms latencies (Ghielmetti et al., 2022, Aarrestad et al., 2021).
- Recurrent Neural Networks: LSTM and GRU with user-controlled static/non-static modes and reuse factors, enabling latency-resource trade-off and achieving <2 μs inference for jet tagging and other scientific workloads (Khoda et al., 2022).
- Graph Neural Networks: Automated lowering of Keras/ONNX GNN topologies, including message-passing–style interaction networks for charged particle track reconstruction (<1 μs latency on Kintex UltraScale) (Heintz et al., 2020).
- Transformer Architectures: Multi-head attention, softmax, and (optionally) layer normalization, with specialized LUT-based implementations for softmax and exp/inverse, enabling sub-2 μs inference on UltraScale+; arbitrary Keras/ONNX transformers can be mapped by adjusting per-layer configuration (Jiang et al., 1 Feb 2024, Jiang et al., 8 Sep 2024).
- Boosted Decision Trees: Direct support for BDTs from scikit-learn/XGBoost/TMVA, instantiated as parallel combinational logic trees, with <100 ns latency for Level-1 triggers (Summers et al., 2020).
4. Model Compression, Quantization, and Design-Space Exploration
hls4ml is tightly coupled with quantization- and pruning-aware training workflows:
- QAT/PTQ: Integration with QKeras, HGQ, Brevitas, and ONNX/QONNX allows both post-training and quantization-aware training, including asymmetric, uniform, heterogeneous, or Hessian-aware mixed-precision schemes (Campos et al., 2023, Schulte et al., 1 Dec 2025, Guglielmo et al., 2020).
- Pruning: Model-optimizer passes support structured and unstructured sparsity. Parameters with magnitude below threshold are pruned and corresponding hardware computation is dropped (Fahim et al., 2021, Weitz et al., 9 Jan 2025).
- Automated Design Sweeps: Python/YAML configuration variables allow global and per-layer sweeps of bit-width, reuse-factor (“parallelism” vs. “latency/resource”), precision, and resourcing strategies.
- Surrogate Modeling: wa-hls4ml and rule4ml provide neural and regression surrogate models for instantaneous prediction of LUT, DSP, FF, BRAM, and latency, trained on >600,000 hls4ml synthesized networks, accurately predicting cycle counts and resources within 10–30% for prospective designs, enabling rapid architecture-hardware co-design (Hawks et al., 6 Nov 2025, Rahimifar et al., 9 Aug 2024).
| Surrogate Tool | Type | Metrics Predicted | R² Range | sMAPE Range |
|---|---|---|---|---|
| wa-hls4ml | GNN/Trans | LUT, DSP, FF, BRAM, Cyc. | 0.89–0.95 | 2.9–15.7% (test set) |
| rule4ml | MLP | LUT, DSP, FF, BRAM, Cyc. | 0.8–0.98 | 10–30% |
5. Device Backends, Performance, and Supported Applications
hls4ml supports FPGAs (Xilinx: Vivado HLS, Vitis HLS; Intel: oneAPI/Quartus HLS), and ASIC flows (Siemens Catapult HLS). Each backend supplies tailored HLS templates for dataflow-pipelined inference, with interface generators for both parallel and streaming protocols (AXI4-Stream, OneAPI pipes, custom memory-mapped buffers). By leveraging platform-specific pragmas (automatic in backends) and portable precision types (ap_fixed, ac_fixed), device support is extensible (e.g., Alveo, Zynq, UltraScale+, Arria/Agilex, and ASICs) (Schulte et al., 1 Dec 2025, Curzel et al., 2021).
Reported performance metrics include:
- Resource-efficiency: Dense MLPs (5–100k parameters) at II=1 with 0–1% of chip DSPs via full latency-unrolling and distributed arithmetic (DA) (Sun et al., 6 Jul 2025, Schulte et al., 1 Dec 2025).
- Ultra-low latency: Sub-2 μs CNNs/Transformers; 10–100 ns for small MLPs, and <1 ms end-to-end for complex sensor-fusion architectures such as U-Net on Arria 10 (Aarrestad et al., 2021, Jiang et al., 8 Sep 2024, Shi et al., 2023).
- Energy/Throughput: O(10 ns) latency, 200 M inference/s for small MLPs; O(1 μJ) per inference on FPGA, nanosecond-range and nJ-scale on 28–65 nm ASICs (Fahim et al., 2021, Schulte et al., 1 Dec 2025).
hls4ml has enabled a variety of scientific deployments:
- Trigger-level physics ML: Level-1 collider triggers (jet tagging, muon momentum, anomaly detection) at 40 MHz, sub-100 ns budget (Summers et al., 2020, Shi et al., 2023, Campos et al., 2023).
- Edge control and sensor fusion: Real-time beam-loss tracking on Arria 10 (575 fps, 1.7 ms), multi-channel qubit readout (32 ns latency) (Shi et al., 2023, Guglielmo et al., 24 Jan 2025).
- Autonomous perception: Real-time multi-class semantic segmentation, e.g., ENet on ZCU102 at 4.8 ms/image, 3 W dynamic power (Ghielmetti et al., 2022).
6. Extensibility, Portability, and Research Ecosystem
hls4ml is designed for extensibility. New layer types, quantizers, and optimization passes can be added via an Extension API at the IR and backend template level (Schulte et al., 1 Dec 2025). Conversion routines and resource estimation flows are continually developed alongside higher-level ecosystem tools, such as wa-hls4ml (benchmark and surrogate prediction), rule4ml (fast regression predictors), da4ml (distribution arithmetic), and experimental improvements in backend portability (constexpr-based LUTs, custom floating-point support, ac_fixed/ac_int, and multi-vendor pragma abstraction) (Sun et al., 6 Jul 2025, Curzel et al., 2021, Hawks et al., 6 Nov 2025, Rahimifar et al., 9 Aug 2024).
hls4ml has demonstrated robust support for a wide range of real-world ML workloads under severe hardware and latency constraints, facilitating the deployment of compressed, quantized, and low-power ML on reconfigurable and custom hardware in scientific and industrial contexts. Its contributions are documented in open-source repositories and benchmarked across contemporary accelerator toolchains (Schulte et al., 1 Dec 2025, Fahim et al., 2021, Hawks et al., 6 Nov 2025).
7. Limitations, Ongoing Development, and Outlook
Key limitations historically identified include backend specialization (Vivado-only pragmas, Xilinx-specific types), limited support for custom FP types, and code size/memory for large models. Recent work addresses these limitations by “de-specializing” the codebase, introducing portable LUT generators, ac_types, and hierarchical graph partitioning for large designs (Curzel et al., 2021, Schulte et al., 1 Dec 2025). True architectural extrapolation in surrogate predictors remains an open challenge; GNN and transformer surrogates outperform simple MLP baselines, but generalization to unseen topologies is actively researched (Hawks et al., 6 Nov 2025, Rahimifar et al., 9 Aug 2024).
Future work is expected to further automate optimal architecture-platform co-design, expand ASIC and embedded software integration, support more complex models (e.g., larger transformers, SSMs), and foster closer integration with quantization/pruning-aware training and neural architecture search workflows (Weitz et al., 9 Jan 2025, Schulte et al., 1 Dec 2025).
References:
- (Schulte et al., 1 Dec 2025) "hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware"
- (Sun et al., 6 Jul 2025) "da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs"
- (Guglielmo et al., 2020) "Compressing deep neural networks on FPGAs to binary and ternary precision with HLS4ML"
- (Shi et al., 2023) "ML-based Real-Time Control at the Edge: An Approach Using hls4ml"
- (Ghielmetti et al., 2022) "Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml"
- (Khoda et al., 2022) "Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml"
- (Jiang et al., 1 Feb 2024) "Ultra Fast Transformers on FPGAs for Particle Physics Experiments"
- (Heintz et al., 2020) "Accelerated Charged Particle Tracking with Graph Neural Networks on FPGAs"
- (Aarrestad et al., 2021) "Fast convolutional neural networks on FPGAs with hls4ml"
- (Fahim et al., 2021) "hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices"
- (Guglielmo et al., 24 Jan 2025) "End-to-end workflow for machine learning-based qubit readout with QICK and hls4ml"
- (Weitz et al., 9 Jan 2025) "Neural Architecture Codesign for Fast Physics Applications"
- (Hawks et al., 6 Nov 2025) "wa-hls4ml: A Benchmark and Surrogate Models for hls4ml Resource and Latency Estimation"
- (Curzel et al., 2021) "De-specializing an HLS library for Deep Neural Networks: improvements upon hls4ml"
- (Rahimifar et al., 9 Aug 2024) "rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA"
- (Campos et al., 2023) "End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs"
- (Summers et al., 2020) "Fast inference of Boosted Decision Trees in FPGAs for particle physics"
- (Jiang et al., 8 Sep 2024) "Low Latency Transformer Inference on FPGAs for Physics Applications with hls4ml"