Papers
Topics
Authors
Recent
2000 character limit reached

ESP4ML: Automated SoC and Sparse ML PIM

Updated 11 December 2025
  • ESP4ML is a design platform that automates SoC assembly by integrating ML accelerators and sparse PIM to deliver energy-efficient, low-latency inference.
  • It employs hardware–software co-design with HLS4ML compilation, NoC orchestration, and dynamic pipeline assembly to optimize throughput and reduce DRAM energy.
  • FPGA prototypes and sparse transformer inference validate ESP4ML’s approach, achieving up to 127× speedup over GPUs and substantial energy savings.

ESP4ML refers to both: (1) a system-level design platform for automating the construction of systems-on-chip (SoCs) that integrate ML accelerators for embedded workloads, and (2) a set of hardware methodologies for efficient processing-in-memory (PIM) acceleration of sparse ML inference. Its core contributions span automated design flows, hardware–software co-design, network-on-chip orchestration, and fine-grained architectural optimizations for energy-efficient, low-latency ML computation. ESP4ML artifacts have been applied to FPGA prototype deployments and to the design of PIM accelerators targeting high-sparsity transformer inference. The acronym is also occasionally used to refer to end-to-end ML data-analysis toolkits (e.g., for EUSO-SPB2 cosmic ray data), but in hardware research, it most commonly denotes the open-source SoC design flow and sparse PIM innovations for ML acceleration (Giri et al., 2020, He et al., 6 Apr 2024, Filippatos et al., 2023).

1. Automated SoC Design Flow and Platform Integration

The ESP4ML design flow vertically integrates high-level ML models, hardware synthesis, NoC integration, embedded Linux software, and runtime APIs to produce heterogeneous SoCs for ML and signal processing (Giri et al., 2020). The workflow proceeds as follows:

  • Model Specification and Training: Users specify a Keras/PyTorch network, then externally train the model.
  • HLS4ML Compilation: The trained model (JSON topology and HDF5 weights) is compiled by HLS4ML into parameterized C++.
  • Automatic Hardware Wrapping: ESP4ML generates a two-part wrapper: (i) HLS C++ (top function, LOAD/COMPUTE/STORE loops with DMA handshake), and (ii) an RTL adapter interfacing with the ESP accelerator-tile protocol via memory-mapped configuration and shallow FIFOs.
  • SoC Assembly: The ESP configuration GUI maps accelerator, processor (Ariane RISC-V with Linux), and memory tiles onto a 2D-mesh NoC and produces the bitstream.
  • Linux Image and Runtime: A prebuilt embedded Linux and ESP4ML runtime library give the designer a bootable SoC supporting the original ML application mapped to hardware accelerators.

With a single command, developers obtain a bootable, integrated SoC capable of running the ML workload on tightly-coupled accelerators.

2. SoC Architecture, Data Movement, and Accelerator Integration

ESP4ML SoCs implement a 2D mesh of tiles linked by a packet-switched NoC with multiple orthogonal planes (32/64 bits per plane) (Giri et al., 2020). Each tile is one of:

  • Processor Tile: RISC-V Ariane CPU running Linux.
  • Memory Tile: Controller for off-chip DRAM.
  • Accelerator Tile: Holds custom accelerator logic, on-chip SRAM (privateBuffer), DMA engines, interrupt logic, and configuration registers.

Accelerators generated by HLS4ML are wrapped by a parameterized interface that bridges HLS interfaces (ap_fifo, control) and the ESP NoC protocol ("ap2p"). Each supports direct DMA to/from DRAM via dedicated NoC planes. For efficient dataflow, unused DMA queues are repurposed to enable point-to-point (p2p) transfers, which, triggered by load requests from downstream tiles, ensure downstream consumption before upstream forwarding ("consumption assumption"). This controls NoC congestion and enables dynamic pipeline assembly of accelerator stages.

Hardware–Software Co-Design

HLS4ML-generated accelerators expose configurable "reuse factor" rr (multiplier sharing trade-off) and are wrapped via an HLS template:

1
2
3
4
5
6
7
8
9
TOP(word *out, word *in1, unsigned conf_size, dma_info_t *load_ctrl, dma_info_t *store_ctrl) {
  word _inbuff[IN_BUF];
  word _outbuff[OUT_BUF];
  for (unsigned i=0; i<n_chunks; i++) {
    LOAD(_inbuff, in1, i, load_ctrl);
    COMPUTE(_inbuff, _outbuff); // HLS4ML kernel
    STORE(_outbuff, out, i, store_ctrl, conf_size);
  }
}

Resource usage and latency scale as DSPsNj/rDSPs \approx N_j / r, LUTsβrLUTs \approx \beta \cdot r, Lj(r)=Nj/rII+αL_j(r) = \lceil N_j / r \rceil \cdot II + \alpha.

Embedded Software APIs

ESP4ML provides a Linux-kernel driver and user-space library. Key APIs:

API Function Description
esp_alloc Allocate DMA-coherent buffer
esp_run Launch a dataflow of N accelerator invocations; configures DMA/p2p
esp_cleanup Free all allocated resources

Technically, esp_run supports configuration of communication mode (DMA or p2p), transfer parameters, and pipeline topology. Multithreaded launch and hardware-level synchronization minimize software overhead for p2p, while DMA-only mode requires pthread barriers.

3. Network-on-Chip Traffic Shaping and Reconfigurable Pipelines

ESP4ML dynamically shapes NoC data traffic to maximize throughput and minimize DRAM accesses (Giri et al., 2020). For each pipeline, the runtime configures accelerator tiles by writing the P2P_REG, which encodes mode bits and source tile coordinates. Transfers are streamed over NoC DMA planes; p2p mode routes packets directly between accelerator tiles, whereas DMA mode goes tile → memory → tile.

The effective throughput of an LL-stage pipeline is Te2eminiTiT_{\mathrm{e2e}} \simeq \min_i T_i, with per-stage throughput Ti=Wplanefclk/(Hiδ)T_i = W_{\mathrm{plane}} \cdot f_{\mathrm{clk}} / (H_i \cdot \delta), where HiH_i is Manhattan hops, δ\delta is NoC/link delay. Empirically, enabling p2p reduces HiH_i to 1–2 and achieves a 2–3×\times reduction in DRAM accesses, translating to proportional energy savings EdropEmem_accessΔAmemE_{\mathrm{drop}} \simeq E_{\mathrm{mem\_access}} \cdot \Delta A_{\mathrm{mem}}.

Reconfigurable pipelines can use multi-threaded launch, with measured throughput improvement of 4–8×\times versus serial execution for ML benchmarks.

4. Efficient Sparse Processing-in-Memory for ML Inference

ESP4ML’s PIM contributions address the bandwidth, load-balancing, and control challenges imposed by high sparsity in transformer-style matrix-vector products (He et al., 6 Apr 2024). The system targets unstructured sparsity (80–90%) seen in pruned LLMs and operates within DRAM constraints (e.g., 16 banks, 256-bit bus, HBM2E-like timing).

Key Techniques

  • Fine-Grained Interleaving: Matrix nonzeros from kk consecutive rows are interleaved in DRAM, so each broadcast slice is reused kk times (in contrast to coarse-grained Newton design). With k=16k=16 and s=0.9s=0.9, broadcast demand per column-read drops from 10 to 0.625.
  • Static Data-Dependent Scheduling (SDDS): Offline, data-dependent, schedule generation orchestrates broadcast, column-reads, and necessary stall insertion. Schedules are merged across banks into a single global command stream. This avoids complex on-chip dynamic control.
  • Index/Value Decoupling and Prefetching: Index and value words are stored in separate DRAM rows. Index-only reads prime per-MAC "iFIFO" buffers (depth 8). Vector elements are pre-fetched into per-MAC "eFIFO" buffers. SDDS ensures stalling to prevent FIFO overflow or underflow.
  • Simplified Switch and Conflict Reduction: Instead of 16 × 16 crossbars, ESP4ML uses time-multiplexed 4 × kk switches per MAC, exploiting tCCD4t_{\mathrm{CCD}} \geq 4 cycle constraint. SDDS reorders nonzeros to minimize group conflicts.
  • Load Balancing: Rows are sorted by density; densest are paired with sparsest across banks, adopting the scheme from SparTen.

Performance and Efficiency

On simulated LLaMA-7B MV kernels (4096 × 4096, 4096 × 11008), ESP4ML achieves:

  • 2× speedup (up to 4.2×) over Newton dense PIM at 90% sparsity
  • 127× speedup over GPU on full LLaMA-7B benchmark
  • 34% average (up to 63%) DRAM+compute energy saving over Newton
  • Area overhead < 5% (sparse) to < 12% (hybrid) over Newton PIM

Contributions of prefetching and conflict reduction are isolated—prefetching adds 30% speed at 80% sparsity; group-reordering cuts stall cycles by 40% at 90% sparsity.

5. FPGA Prototype, Applications, and Evaluation

ESP4ML SoCs were prototyped on Xilinx UltraScale+ FPGA at 78 MHz (Giri et al., 2020). Example pipelines include:

Design LUT (%) FF (%) BRAM (%) Power (W) Frames/s Frames/J (F_E)
Night-Vision 48 24 57 1.70 35,572 20,925
Denoiser 48 24 57 1.70 5,220 3,070
Multi-tile Classifier 19 11 21 0.98 28,376 28,900

Compared to Intel i7 and Jetson TX1, ESP4ML SoCs are up to two orders of magnitude more energy efficient on these ML benchmarks.

Enabling pipeline reconfiguration boosts throughput 4–8×\times, and p2p communication reduces DRAM traffic by a further 2–3×\times.

6. Extensions, Limitations, and Future Directions

ESP4ML’s automated flow, hardware-efficient architectures, and dataflow APIs constitute significant advances in embedded ML accelerator design.

  • Strengths: End-to-end automation, energy efficiency, tight NoC integration, runtime pipeline management, flexibility for both dense and sparse workloads, and quantitative gains on real ML tasks.
  • Limitations: Resource allocation (e.g., reuse factor selection), NoC scale, and hardware parameter tuning require further automation or manual adjustment. Public documentation of all acceleration templates is ongoing.
  • Directions: Systematic hyperparameter search (AutoML), geometry-aware or graph-CNN cores, multi-task network support, and end-to-end deep learning (from raw input to physics parameters) are planned (Filippatos et al., 2023, Giri et al., 2020).
  • Sparse PIM: Further area optimization of switch logic, deeper pipeline scaling, and balancing of energy vs. throughput under real LLM workloads remain open topics (He et al., 6 Apr 2024).

ESP4ML’s integration of automated SoC design, efficient accelerator orchestration, and advanced sparse PIM demonstrates the viability of scalable, high-throughput, low-energy ML accelerators across embedded and datacenter settings.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ESP4ML.