hls4ml Platform for FPGA ML Acceleration
- hls4ml is an open-source platform that transforms diverse ML models into high-level synthesis code, enabling FPGA and ASIC hardware acceleration.
- The platform supports multiple ML frameworks and toolchains, incorporating graph-level optimization and pipelined code generation for minimal latency and efficient resource usage.
- By leveraging techniques like quantization, pruning, and precision tailoring, hls4ml ensures scalable, low-power, and high-performance inference for scientific and industrial applications.
hls4ml is an open-source platform designed to automate the translation of trained ML models, including deep neural networks (DNNs), boosted decision trees (BDTs), convolutional architectures, and transformers, into synthesizable high-level synthesis (HLS) code optimized for deployment on field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). Featuring a Python-fronted workflow, hls4ml enables practitioners from various scientific and engineering domains to convert high-level models—trained in frameworks such as TensorFlow, Keras, PyTorch, scikit-learn, and XGBoost—into deeply pipelined, low-latency hardware accelerators. The platform supports multiple toolchains and hardware vendors, including Xilinx Vivado/Vitis HLS, Intel Quartus HLS/oneAPI, and Siemens Catapult HLS, and features extensive controls over precision, parallelism, resource usage, and integration style, making it suitable for stringent latency, throughput, and power constraints encountered in scientific triggering, edge inference, and real-time control (Schulte et al., 1 Dec 2025, Summers et al., 2020, Fahim et al., 2021, Aarrestad et al., 2021, Khoda et al., 2022, Jiang et al., 8 Sep 2024).
1. Architecture and End-to-End Workflow
hls4ml is structured as a model compiler, supporting the following staged workflow:
- Front-End Model Import: Trained models are parsed from supported frameworks (Keras, QKeras, PyTorch, ONNX, scikit-learn, XGBoost, TMVA) into an intermediate representation (IR). Layer topology, weights, activations, and quantization metadata are extracted by framework-specific handlers (Schulte et al., 1 Dec 2025, Khoda et al., 2022, Summers et al., 2020).
- Graph-Level Optimization: The IR undergoes optimizer passes such as precision propagation, batch normalization fusion, data layout transformation (e.g., channels-last enforcement), and FIFO-depth optimization. BatchNorm and activation layers can be fused to minimize arithmetic and I/O overhead (Fahim et al., 2021, Schulte et al., 1 Dec 2025).
- Back-End Code Generation: The optimized IR is lowered to HLS C++ (or SystemC) kernels using layer-specific templates annotated with vendor pragmas (e.g.,
#pragma HLS PIPELINE II=1,#pragma HLS DATAFLOW,#pragma HLS ARRAY_PARTITION) to instruct pipelining, parallelization, and memory mapping. For large models, hls4ml supports partitioning into subgraphs for parallel HLS synthesis (Schulte et al., 1 Dec 2025, Aarrestad et al., 2021, Curzel et al., 2021). - HLS Synthesis & FPGA Integration: The generated HLS project is synthesized by the target vendor toolchain (Vivado, Vitis, Quartus, Catapult), producing an RTL netlist and IP core. The IP is wrapped with AXI-Lite or AXI-Stream interfaces for integration into full designs and SoC shells (Schulte et al., 1 Dec 2025, Shi et al., 2023, Giri et al., 2020).
- Post-HLS Evaluation: Vivado, Quartus, or Catapult produce detailed reports on initiation interval (II), latency, resource utilization (LUTs, FFs, DSPs, BRAMs), and timing. These can be correlated with the user's configuration parameters to enable iterative optimization (Summers et al., 2020, Fahim et al., 2021).
This workflow is extensible to ASIC through a Catapult HLS backend, enabling direct power, area, and timing closure in digital flows (Fahim et al., 2021).
2. Supported Model Types and Algorithmic Features
hls4ml accommodates a wide spectrum of ML models:
- Fully Connected Networks (MLPs): Layer templates implement matrix-vector multiplication and activation functions, enabling inference latencies as low as 10–50 ns for small networks on UltraScale+ FPGAs (Schulte et al., 1 Dec 2025, Fahim et al., 2021).
- Convolutional Neural Networks (CNNs): Streaming convolutional layers deploy line-buffered architectures, fully pipelined to II=1 with support for aggressive quantization and pruning. Example ENet models achieve <5 ms per image at <30% resource usage on a ZCU102 (Aarrestad et al., 2021, Ghielmetti et al., 2022).
- Boosted Decision Trees: hls4ml performs tree-wise conversion to unrolled comparator logic, LUT-based leaf scoring, and ensemble summation via balanced binary adder trees. Benchmarks demonstrate <100 ns latencies and <10% LUT usage for 100-tree classifiers (Summers et al., 2020).
- Recurrent Neural Networks (LSTM, GRU): Gate equations and state updates are realized using pipelined matrix-vector kernels and lookup-table activations. Both static (minimal resource) and non-static (parallel timestep) modes are available (Khoda et al., 2022).
- Transformer Architectures: Multi-head scaled dot-product attention, softmax, and layer normalization are mapped into pipeline stages using per-layer fixed-point arithmetic, visible in sub-2 µs latencies for moderate sequence lengths on VU13P (Jiang et al., 8 Sep 2024).
- Binary and Ternary Networks: Bit-packed arithmetic, XNOR-popcount computation, and thresholded batch norm are supported, reducing DSP usage to zero for many models with modest accuracy loss (Guglielmo et al., 2020).
Layer libraries include Dense, Conv1D/2D, Pooling, BatchNorm, LayerNorm, MultiHeadAttention, Einsum, and custom extension APIs (Schulte et al., 1 Dec 2025, Jiang et al., 8 Sep 2024).
3. Precision, Quantization, and Compression Mechanisms
hls4ml provides advanced support for model compression and precision tailoring:
- Fixed-Point Arithmetic: Arbitrary bit-widths (ap_fixed<total, integer>) are configured globally or per-layer. Precision propagation uses interval analysis and profiling to avoid overflow (Aarrestad et al., 2021, Schulte et al., 1 Dec 2025).
- Quantization-Aware Training (QAT): Integration with QKeras, HGQ, brevitas, and AutoQKeras enables post-training and in-training quantization, deploying models with as few as 2–8 bits per parameter without significant accuracy degradation (Fahim et al., 2021, Ghielmetti et al., 2022, Guglielmo et al., 24 Jan 2025).
- Pruning: Magnitude-based and lottery ticket-style structured pruning set weights to zero, enabling logic and DSP savings commensurate with the degree of sparsity (Fahim et al., 2021, Aarrestad et al., 2021).
- Binary/Ternary Models: XNOR-popcount implementations for binary networks eliminate DSP usage, using LUTs and threshold-based batch norm fusion. Hybrid models (partial full-precision) optimize trade-offs between accuracy and resources (Guglielmo et al., 2020).
- Heterogeneous Quantization: AutoQKeras allows per-block or per-layer bit-width assignment, leveraging Bayesian optimization to maximize mIoU and minimize power (Ghielmetti et al., 2022).
Best practices suggest initializing with 16–18 bit fixed-point for broad accuracy preservation, then aggressive pruning and precision scans to optimize resource usage.
4. Performance, Resource Utilization, and Scalability
hls4ml achieves ultra-low latency and tunable resource allocation:
- Latency: Inference times range from 10 ns for small fully connected networks to microseconds for moderate CNNs and transformers, depending on reuse factor and model parallelism (Schulte et al., 1 Dec 2025, Summers et al., 2020, Jiang et al., 8 Sep 2024, Aarrestad et al., 2021).
- Initiation Interval (II): Designs are aggressively pipelined (typically II=1) using pipeline and dataflow pragmas. Reuse factor trades off parallelism for reduced resource consumption at higher latency (Fahim et al., 2021, Aarrestad et al., 2021).
- Resource Scaling Laws: LUT, DSP, and BRAM usage is modeled by layerwise MAC counts, bit-width, and reuse factor. Empirical fits (e.g., for BDTs) guide design sweeps (Summers et al., 2020).
- Device Targets: Supported FPGAs include Xilinx UltraScale+, Kintex, Alveo, Zynq, Intel Arria/Agilex, and experimental ASICs via Catapult (Schulte et al., 1 Dec 2025, Fahim et al., 2021, Khoda et al., 2022, Giri et al., 2020).
- Memory Optimization: FIFO depths are trimmed by post-synthesis simulation, line-buffered convolution reduces BRAM, and full streaming ensures all processing is on-chip (Ghielmetti et al., 2022, Aarrestad et al., 2021).
5. Integration, Portability, and Ecosystem
hls4ml is modular and extensible, supporting diverse deployment models:
- Tool Integration: Python APIs enable model import, configuration, and resource/latency profiling. Codegen supports multiple HLS backends with macro-wrapped pragmas for vendor agnosticism (Curzel et al., 2021, Schulte et al., 1 Dec 2025).
- System-Level Design: With ESP4ML, hls4ml-generated accelerators can be embedded in heterogeneous SoCs, equipped with DMA/P2P interfaces, and orchestrated under Linux runtimes for multi-tile, energy-efficient pipelines (Giri et al., 2020).
- Custom Extensions: API provisions for user-defined layer templates, resource-aware pruning, integration into custom SoC shells, and per-layer overrides for device adaptation (Fahim et al., 2021, Schulte et al., 1 Dec 2025, Curzel et al., 2021).
- Deployment Modalities: AXI-Lite, AXI-Stream, BRAM, and memory-mapped interfaces facilitate integration into high-throughput, real-time, and edge environments (Shi et al., 2023, Guglielmo et al., 24 Jan 2025, Ghielmetti et al., 2022).
- Visualization and Profiling: Utilities for bit-width assignment, model graph visualization, and weight distribution histograms support performance–area–power exploration (Fahim et al., 2021).
6. Scientific and Industrial Applications
hls4ml is extensively validated in physics and commercial domains:
- High-Energy Physics: LHC Level-1 triggers implement jet tagging, muon momentum regression, and convolutional autoencoder data compression at MHz rates and sub-μs latencies (Schulte et al., 1 Dec 2025, Fahim et al., 2021, Summers et al., 2020).
- Quantum Computing: QICK integration for qubit readout achieves 32 ns latency and 96% single-shot fidelity on UltraScale+ RFSoC (Guglielmo et al., 24 Jan 2025).
- Autonomous Systems: Real-time semantic segmentation at 3–4.9 ms/image for vehicle perception, with sub-30% device utilization (Ghielmetti et al., 2022).
- Edge and IoT: Wildlife filtering, industrial vision, cloud infrastructure, and cell-sorting exploit low-power, deeply pipelined inference architectures (Fahim et al., 2021, Schulte et al., 1 Dec 2025).
- Co-Design Pipelines: ESP4ML automates platform-based SoC synthesis with integrated ML accelerators, supporting both ML and classical DSP kernels in a heterogenous tile-based network-on-chip (Giri et al., 2020).
7. Limitations and Future Directions
hls4ml’s established strengths include rapid design space exploration, compatibility with multiple frameworks and toolchains, and deep integration with scientific workflows. Identified limitations include:
- Scalability: Very large models may exceed on-chip memory and synthesis capacities; partitioning strategies and HBM2 support are under development (Curzel et al., 2021).
- Precision Support: Main precision handling is fixed-point; custom floating-point is proposed for expansion, which is necessary for domains requiring high dynamic range (Curzel et al., 2021).
- Vendor Lock-In: While de-specialization efforts improve portability, certain activation and memory interfaces still rely on vendor-specific pragmas (Curzel et al., 2021, Schulte et al., 1 Dec 2025).
- Sparsity and Structured Pruning: Further architectural support for zero-skipping and hardware-aware sparsification is in development for more aggressive resource reduction (Fahim et al., 2021).
Ongoing work targets extension to transformer variants, support for causal masking in attention, dynamic per-layer bit-width search, and runtime reconfiguration, aiming to preserve hls4ml’s usability while broadening its applicability across the rapidly evolving landscape of accelerator design (Schulte et al., 1 Dec 2025, Jiang et al., 8 Sep 2024, Curzel et al., 2021).