ESP4ML: Automated SoC and Sparse ML PIM
- ESP4ML is a design platform that automates SoC assembly by integrating ML accelerators and sparse PIM to deliver energy-efficient, low-latency inference.
- It employs hardware–software co-design with HLS4ML compilation, NoC orchestration, and dynamic pipeline assembly to optimize throughput and reduce DRAM energy.
- FPGA prototypes and sparse transformer inference validate ESP4ML’s approach, achieving up to 127× speedup over GPUs and substantial energy savings.
ESP4ML refers to both: (1) a system-level design platform for automating the construction of systems-on-chip (SoCs) that integrate ML accelerators for embedded workloads, and (2) a set of hardware methodologies for efficient processing-in-memory (PIM) acceleration of sparse ML inference. Its core contributions span automated design flows, hardware–software co-design, network-on-chip orchestration, and fine-grained architectural optimizations for energy-efficient, low-latency ML computation. ESP4ML artifacts have been applied to FPGA prototype deployments and to the design of PIM accelerators targeting high-sparsity transformer inference. The acronym is also occasionally used to refer to end-to-end ML data-analysis toolkits (e.g., for EUSO-SPB2 cosmic ray data), but in hardware research, it most commonly denotes the open-source SoC design flow and sparse PIM innovations for ML acceleration (Giri et al., 2020, He et al., 6 Apr 2024, Filippatos et al., 2023).
1. Automated SoC Design Flow and Platform Integration
The ESP4ML design flow vertically integrates high-level ML models, hardware synthesis, NoC integration, embedded Linux software, and runtime APIs to produce heterogeneous SoCs for ML and signal processing (Giri et al., 2020). The workflow proceeds as follows:
- Model Specification and Training: Users specify a Keras/PyTorch network, then externally train the model.
- HLS4ML Compilation: The trained model (JSON topology and HDF5 weights) is compiled by HLS4ML into parameterized C++.
- Automatic Hardware Wrapping: ESP4ML generates a two-part wrapper: (i) HLS C++ (top function, LOAD/COMPUTE/STORE loops with DMA handshake), and (ii) an RTL adapter interfacing with the ESP accelerator-tile protocol via memory-mapped configuration and shallow FIFOs.
- SoC Assembly: The ESP configuration GUI maps accelerator, processor (Ariane RISC-V with Linux), and memory tiles onto a 2D-mesh NoC and produces the bitstream.
- Linux Image and Runtime: A prebuilt embedded Linux and ESP4ML runtime library give the designer a bootable SoC supporting the original ML application mapped to hardware accelerators.
With a single command, developers obtain a bootable, integrated SoC capable of running the ML workload on tightly-coupled accelerators.
2. SoC Architecture, Data Movement, and Accelerator Integration
ESP4ML SoCs implement a 2D mesh of tiles linked by a packet-switched NoC with multiple orthogonal planes (32/64 bits per plane) (Giri et al., 2020). Each tile is one of:
- Processor Tile: RISC-V Ariane CPU running Linux.
- Memory Tile: Controller for off-chip DRAM.
- Accelerator Tile: Holds custom accelerator logic, on-chip SRAM (privateBuffer), DMA engines, interrupt logic, and configuration registers.
Accelerators generated by HLS4ML are wrapped by a parameterized interface that bridges HLS interfaces (ap_fifo, control) and the ESP NoC protocol ("ap2p"). Each supports direct DMA to/from DRAM via dedicated NoC planes. For efficient dataflow, unused DMA queues are repurposed to enable point-to-point (p2p) transfers, which, triggered by load requests from downstream tiles, ensure downstream consumption before upstream forwarding ("consumption assumption"). This controls NoC congestion and enables dynamic pipeline assembly of accelerator stages.
Hardware–Software Co-Design
HLS4ML-generated accelerators expose configurable "reuse factor" (multiplier sharing trade-off) and are wrapped via an HLS template:
1 2 3 4 5 6 7 8 9 |
TOP(word *out, word *in1, unsigned conf_size, dma_info_t *load_ctrl, dma_info_t *store_ctrl) { word _inbuff[IN_BUF]; word _outbuff[OUT_BUF]; for (unsigned i=0; i<n_chunks; i++) { LOAD(_inbuff, in1, i, load_ctrl); COMPUTE(_inbuff, _outbuff); // HLS4ML kernel STORE(_outbuff, out, i, store_ctrl, conf_size); } } |
Resource usage and latency scale as , , .
Embedded Software APIs
ESP4ML provides a Linux-kernel driver and user-space library. Key APIs:
| API Function | Description |
|---|---|
| esp_alloc | Allocate DMA-coherent buffer |
| esp_run | Launch a dataflow of N accelerator invocations; configures DMA/p2p |
| esp_cleanup | Free all allocated resources |
Technically, esp_run supports configuration of communication mode (DMA or p2p), transfer parameters, and pipeline topology. Multithreaded launch and hardware-level synchronization minimize software overhead for p2p, while DMA-only mode requires pthread barriers.
3. Network-on-Chip Traffic Shaping and Reconfigurable Pipelines
ESP4ML dynamically shapes NoC data traffic to maximize throughput and minimize DRAM accesses (Giri et al., 2020). For each pipeline, the runtime configures accelerator tiles by writing the P2P_REG, which encodes mode bits and source tile coordinates. Transfers are streamed over NoC DMA planes; p2p mode routes packets directly between accelerator tiles, whereas DMA mode goes tile → memory → tile.
The effective throughput of an -stage pipeline is , with per-stage throughput , where is Manhattan hops, is NoC/link delay. Empirically, enabling p2p reduces to 1–2 and achieves a 2–3 reduction in DRAM accesses, translating to proportional energy savings .
Reconfigurable pipelines can use multi-threaded launch, with measured throughput improvement of 4–8 versus serial execution for ML benchmarks.
4. Efficient Sparse Processing-in-Memory for ML Inference
ESP4ML’s PIM contributions address the bandwidth, load-balancing, and control challenges imposed by high sparsity in transformer-style matrix-vector products (He et al., 6 Apr 2024). The system targets unstructured sparsity (80–90%) seen in pruned LLMs and operates within DRAM constraints (e.g., 16 banks, 256-bit bus, HBM2E-like timing).
Key Techniques
- Fine-Grained Interleaving: Matrix nonzeros from consecutive rows are interleaved in DRAM, so each broadcast slice is reused times (in contrast to coarse-grained Newton design). With and , broadcast demand per column-read drops from 10 to 0.625.
- Static Data-Dependent Scheduling (SDDS): Offline, data-dependent, schedule generation orchestrates broadcast, column-reads, and necessary stall insertion. Schedules are merged across banks into a single global command stream. This avoids complex on-chip dynamic control.
- Index/Value Decoupling and Prefetching: Index and value words are stored in separate DRAM rows. Index-only reads prime per-MAC "iFIFO" buffers (depth 8). Vector elements are pre-fetched into per-MAC "eFIFO" buffers. SDDS ensures stalling to prevent FIFO overflow or underflow.
- Simplified Switch and Conflict Reduction: Instead of 16 × 16 crossbars, ESP4ML uses time-multiplexed 4 × switches per MAC, exploiting cycle constraint. SDDS reorders nonzeros to minimize group conflicts.
- Load Balancing: Rows are sorted by density; densest are paired with sparsest across banks, adopting the scheme from SparTen.
Performance and Efficiency
On simulated LLaMA-7B MV kernels (4096 × 4096, 4096 × 11008), ESP4ML achieves:
- 2× speedup (up to 4.2×) over Newton dense PIM at 90% sparsity
- 127× speedup over GPU on full LLaMA-7B benchmark
- 34% average (up to 63%) DRAM+compute energy saving over Newton
- Area overhead < 5% (sparse) to < 12% (hybrid) over Newton PIM
Contributions of prefetching and conflict reduction are isolated—prefetching adds 30% speed at 80% sparsity; group-reordering cuts stall cycles by 40% at 90% sparsity.
5. FPGA Prototype, Applications, and Evaluation
ESP4ML SoCs were prototyped on Xilinx UltraScale+ FPGA at 78 MHz (Giri et al., 2020). Example pipelines include:
| Design | LUT (%) | FF (%) | BRAM (%) | Power (W) | Frames/s | Frames/J (F_E) |
|---|---|---|---|---|---|---|
| Night-Vision | 48 | 24 | 57 | 1.70 | 35,572 | 20,925 |
| Denoiser | 48 | 24 | 57 | 1.70 | 5,220 | 3,070 |
| Multi-tile Classifier | 19 | 11 | 21 | 0.98 | 28,376 | 28,900 |
Compared to Intel i7 and Jetson TX1, ESP4ML SoCs are up to two orders of magnitude more energy efficient on these ML benchmarks.
Enabling pipeline reconfiguration boosts throughput 4–8, and p2p communication reduces DRAM traffic by a further 2–3.
6. Extensions, Limitations, and Future Directions
ESP4ML’s automated flow, hardware-efficient architectures, and dataflow APIs constitute significant advances in embedded ML accelerator design.
- Strengths: End-to-end automation, energy efficiency, tight NoC integration, runtime pipeline management, flexibility for both dense and sparse workloads, and quantitative gains on real ML tasks.
- Limitations: Resource allocation (e.g., reuse factor selection), NoC scale, and hardware parameter tuning require further automation or manual adjustment. Public documentation of all acceleration templates is ongoing.
- Directions: Systematic hyperparameter search (AutoML), geometry-aware or graph-CNN cores, multi-task network support, and end-to-end deep learning (from raw input to physics parameters) are planned (Filippatos et al., 2023, Giri et al., 2020).
- Sparse PIM: Further area optimization of switch logic, deeper pipeline scaling, and balancing of energy vs. throughput under real LLM workloads remain open topics (He et al., 6 Apr 2024).
ESP4ML’s integration of automated SoC design, efficient accelerator orchestration, and advanced sparse PIM demonstrates the viability of scalable, high-throughput, low-energy ML accelerators across embedded and datacenter settings.