SATAY: FPGA Architecture for Accelerating YOLO
- SATAY is a streaming architectural toolflow that maps various YOLO models (v3–v8) onto FPGA platforms for real-time, edge-based inference.
- It implements every computational layer as a dedicated hardware block in a fully pipelined on-chip design, reducing off-chip memory dependency.
- An automated design-space exploration engine identifies Pareto-optimal trade-offs in latency, resource use, and energy efficiency for scalable deployments.
SATAY (Streaming Architecture Toolflow for Accelerating YOLO) is an end-to-end toolflow specifically designed to map a wide spectrum of YOLO object detection networks—including versions 3 through 8 and their compressed variants—onto FPGA platforms for real-time, ultra-low-latency edge inference. It achieves this via a deeply pipelined, fully on-chip streaming dataflow architecture and an automated design-space exploration (DSE) engine, yielding Pareto-optimal trade-offs in latency, resource usage, and energy efficiency. SATAY implements every computational layer as a dedicated, parameterizable hardware block, minimizes off-chip memory dependency, and leverages specialized modules and dynamic buffer allocation strategies to support the full YOLO pipeline on FPGAs (Montgomerie-Corcoran et al., 2023).
1. Streaming Dataflow Architecture and Rationale
SATAY’s architecture is predicated on the premise that bottlenecks in FPGA-based DNN acceleration predominantly arise from high off-chip bandwidth requirement and suboptimal utilization of on-chip parallelism. All layer parameters, including weights, are stored in on-chip memories (URAM/BRAM), such that inference does not incur off-chip weight fetches. Each ONNX operation—convolution, pooling, resize, activation, split/concat, addition, detection head—is realized as a dedicated hardware module interconnected via ready/valid FIFO channels.
In this design, fine-grained pipelining across consecutive layers achieves an initiation interval (II) close to 1, so theoretical peak throughput is dictated only by operating frequency. Back-to-back hardware blocks exploit both inter- and intra-layer parallelism, minimizing overall latency and maximizing resource utilization.
Custom hardware modules introduced in SATAY include:
- Convolution Units: Employ sliding-window line buffers and DSP arrays, with on-chip weight storage for locality.
- Max-Pooling Units: Implement sliding windows and comparator trees.
- Resize Units: Cache row data; perform pixel duplication using data-dependent MUX trees.
- Activation Units: Include Leaky ReLU (constant multiplier, sign-based MUX) and HardSwish (piecewise SiLU approximation via adders, comparators, multipliers).
- Split/Concat/Add Units: Implement feature fusion by channel-wise routing with minimal buffering.
2. Performance Modeling and Pipelining
The SATAY performance model allows analytical reasoning about per-layer and end-to-end latency and throughput:
- Convolution Latency (node , parallelism ):
- Other Operations (pooling, activation, etc.):
- Pipeline Latency:
where is the hardware pipeline depth of node (in cycles).
- Peak Throughput: , for .
By mapping all computations as hardware pipelines, SATAY’s architecture supports highly concurrent processing, matching the streaming paradigm's throughput upper bounds.
3. Buffering: On-Chip and Off-Chip Trade-Offs
In streaming pipelines with skip connections, feature maps may need to be buffered until downstream blocks are ready. On-chip RAM accommodates weights (50–89% of BRAM/URAM), sliding window line buffers (4–20%), and skip-connection FIFOs (7–30%). To mitigate cases where skip FIFOs exceed on-chip availability, SATAY dynamically moves the largest FIFOs off-chip, employing a software FIFO on the ARM Processing System (PS) with burst DMA.
Buffer bandwidth and sizing are quantified as:
- Off-Chip Buffer Bandwidth (for buffer ):
where is activation bit-width, is total latency.
- On-Chip Buffer Size: (if allocated on-chip).
A greedy buffer allocation algorithm sorts buffers by size and iteratively moves the largest off-chip until total on-chip allocation fits resource constraints. Empirical evaluation on YOLOv5n (640×640) deployed on ZCU104 shows that moving the five largest skip FIFOs off-chip reduces buffer RAM by 56% (17% of on-chip memory), increasing bandwidth occupancy only to 2.15 Gbps (from 1.6 Gbps), well below typical DDR limits (135 Gbps).
4. Automated Toolflow and Design-Space Exploration
SATAY automates the end-to-end compilation of YOLO models to FPGA bitstreams via three phases:
- Parsing: Converts a floating-point ONNX model into an intermediate computational graph; simulates post-training quantization with per-layer scale and zero-point (scaling and ).
- Design-Space Exploration (DSE):
- Quantization Sweep: Searches weight bit-widths (activations ) to ensure mAP drop at for all YOLO variants.
- DSP Allocation Algorithm (Algorithm 1): Greedily increments per-layer parallelism factor in the layer yielding maximal latency reduction per DSP until the device’s DSP budget is exhausted.
- Buffer Allocation Algorithm (Algorithm 2): Relocates largest skip FIFOs off-chip as needed for resource balance.
- Generation: Emits parameterized HLS/Vivado blocks, connects nodes via ready/valid FIFOs, inserts DMA interfaces for off-chip FIFOs, and performs place-and-route on the target FPGA.
Exposed top-level parameters include weight and activation bit-widths (, ), per-layer parallelism , buffer placement , and clock frequency (200–270 MHz typical).
5. Empirical Performance and Comparisons
SATAY demonstrates competitive performance and energy characteristics across various FPGA devices and GPU baselines. Experimental results include:
| Model (input size) | FPGA (Device) | Latency (ms) | Power (W) | GOP/s | GOP/s/DSP |
|---|---|---|---|---|---|
| YOLOv3-Tiny (416×416) | VCU118 (255 MHz) | 6.8 | 42.9 | 875.7 | 0.13 |
| YOLOv5s (640×640) | VCU118 (270 MHz) | 14.9 | 67.0 | 1219.8 | 0.24 |
| YOLOv8s (640×640) | VCU118 (240 MHz) | 24.5 | 57.4 | 1244.0 | 0.18 |
In terms of GOP/s/DSP, SATAY outperforms contemporary FPGA accelerators: for YOLOv3-Tiny, achieving 0.24 (vs. 0.07–0.18); for YOLOv5s, 0.22–0.24 (vs. 0.03).
Energy efficiency analysis for YOLOv5n (320×320) on ZCU104 yields latency of 9.83 ms (FPGA, 14.8 W, 142.6 mJ) compared to 10.73 ms (Jetson TX2, 6.6 W, 70.7 mJ). Despite higher power consumption, the FPGA is approximately 1.1× faster. Across YOLO models, SATAY accelerators average 79× faster than ARM Cortex-A72 CPU inference and 3.6× faster than Jetson TX2 GPU.
6. Design Trade-Offs and Scalability
SATAY supports broad YOLO model coverage by instantiating or reconfiguring hardware blocks and adjusting DSE parameters, enabling rapid adaption to YOLOv3, v5, and v8, as well as "nano" and "small" variants. Resource/latency trade-offs follow from parallelism scaling: for example, YOLOv5s with per-layer as high as 16 reaches 14.9 ms on VCU118 (); lower accommodates smaller FPGAs with increased latency.
Quantization to 8-bit weights (, ) incurs mAP loss on COCO-Val2017 for all YOLO versions. Off-chip FIFO placement reduces on-chip RAM consumption by up to 17%, with off-chip bandwidth remaining below 2 Gbps—orders of magnitude lower than available DDR channels.
A plausible implication is that the toolflow’s parameterization and algorithmic buffer placement enable efficient scalability from resource-constrained edge FPGAs to high-end accelerator platforms with minimal model-specific engineering.
7. Role in Edge-Based Real-Time Deployment
SATAY addresses the challenge of deploying computationally demanding object detection models in latency-sensitive, edge scenarios such as autonomous vehicles and medical imaging. By encapsulating streaming principles, hardware specialization, and quantitative design-space search, SATAY enables real-time YOLO inference with latencies consistently in the sub-tens-of-millisecond regime on widely available FPGAs, matching or exceeding GPU baselines under comparable operational constraints (Montgomerie-Corcoran et al., 2023).