SATAY: FPGA Architecture for Accelerating YOLO

Updated 31 March 2026

SATAY is a streaming architectural toolflow that maps various YOLO models (v3–v8) onto FPGA platforms for real-time, edge-based inference.
It implements every computational layer as a dedicated hardware block in a fully pipelined on-chip design, reducing off-chip memory dependency.
An automated design-space exploration engine identifies Pareto-optimal trade-offs in latency, resource use, and energy efficiency for scalable deployments.

SATAY (Streaming Architecture Toolflow for Accelerating YOLO) is an end-to-end toolflow specifically designed to map a wide spectrum of YOLO object detection networks—including versions 3 through 8 and their compressed variants—onto FPGA platforms for real-time, ultra-low-latency edge inference. It achieves this via a deeply pipelined, fully on-chip streaming dataflow architecture and an automated design-space exploration (DSE) engine, yielding Pareto-optimal trade-offs in latency, resource usage, and energy efficiency. SATAY implements every computational layer as a dedicated, parameterizable hardware block, minimizes off-chip memory dependency, and leverages specialized modules and dynamic buffer allocation strategies to support the full YOLO pipeline on FPGAs (Montgomerie-Corcoran et al., 2023).

1. Streaming Dataflow Architecture and Rationale

SATAY’s architecture is predicated on the premise that bottlenecks in FPGA-based DNN acceleration predominantly arise from high off-chip bandwidth requirement and suboptimal utilization of on-chip parallelism. All layer parameters, including weights, are stored in on-chip memories (URAM/BRAM), such that inference does not incur off-chip weight fetches. Each ONNX operation—convolution, pooling, resize, activation, split/concat, addition, detection head—is realized as a dedicated hardware module interconnected via ready/valid FIFO channels.

In this design, fine-grained pipelining across consecutive layers achieves an initiation interval (II) close to 1, so theoretical peak throughput $T_{throughput} \approx f_{clk}/II$ is dictated only by operating frequency. Back-to-back hardware blocks exploit both inter- and intra-layer parallelism, minimizing overall latency and maximizing resource utilization.

Custom hardware modules introduced in SATAY include:

Convolution Units: Employ sliding-window line buffers and $K \times K$ DSP arrays, with on-chip weight storage for locality.
Max-Pooling Units: Implement sliding windows and comparator trees.
Resize Units: Cache row data; perform pixel duplication using data-dependent MUX trees.
Activation Units: Include Leaky ReLU (constant multiplier, sign-based MUX) and HardSwish (piecewise SiLU approximation via adders, comparators, multipliers).
Split/Concat/Add Units: Implement feature fusion by channel-wise routing with minimal buffering.

2. Performance Modeling and Pipelining

The SATAY performance model allows analytical reasoning about per-layer and end-to-end latency and throughput:

Convolution Latency (node $n$ , parallelism $p_n$ ):

$l(n, p) = \frac{H_n \cdot W_n \cdot C_n \cdot F_n}{p_n \cdot f_{clk}}$

Other Operations (pooling, activation, etc.):

$l(n, p) = \frac{H_n \cdot W_n \cdot C_n}{p_n \cdot f_{clk}}$

Pipeline Latency:

$L(p) = \max_n l(n, p) + \sum_n \frac{d_n}{f_{clk}}$

where $d_n$ is the hardware pipeline depth of node $n$ (in cycles).

Peak Throughput: $T_{throughput} = f_{clk} / II$ , for $II \approx 1$ .

By mapping all computations as hardware pipelines, SATAY’s architecture supports highly concurrent processing, matching the streaming paradigm's throughput upper bounds.

3. Buffering: On-Chip and Off-Chip Trade-Offs

In streaming pipelines with skip connections, feature maps may need to be buffered until downstream blocks are ready. On-chip RAM accommodates weights (50–89% of BRAM/URAM), sliding window line buffers (4–20%), and skip-connection FIFOs (7–30%). To mitigate cases where skip FIFOs exceed on-chip availability, SATAY dynamically moves the largest FIFOs off-chip, employing a software FIFO on the ARM Processing System (PS) with burst DMA.

Buffer bandwidth and sizing are quantified as:

Off-Chip Buffer Bandwidth (for buffer $n \to m$ ):

$b_{buf} = \frac{2 \cdot (H_{nm} \cdot W_{nm} \cdot C_{nm} \cdot w_a)}{L},$

where $w_a$ is activation bit-width, $L$ is total latency.

On-Chip Buffer Size: $s_{buf} = q_{nm} \cdot w_a$ (if allocated on-chip).

A greedy buffer allocation algorithm sorts buffers by size and iteratively moves the largest off-chip until total on-chip allocation fits resource constraints. Empirical evaluation on YOLOv5n (640×640) deployed on ZCU104 shows that moving the five largest skip FIFOs off-chip reduces buffer RAM by 56% (17% of on-chip memory), increasing bandwidth occupancy only to 2.15 Gbps (from 1.6 Gbps), well below typical DDR limits (135 Gbps).

4. Automated Toolflow and Design-Space Exploration

SATAY automates the end-to-end compilation of YOLO models to FPGA bitstreams via three phases:

Parsing: Converts a floating-point ONNX model into an intermediate computational graph; simulates post-training quantization with per-layer scale and zero-point (scaling $S = (w_{max}-w_{min})/ (2^L-1)$ and $Z = \mathrm{round}(w_{min} \cdot S) + 2^{L-1}$ ).
Design-Space Exploration (DSE):
- Quantization Sweep: Searches weight bit-widths $w_w \in \{4,8,16\}$ (activations $w_a=16$ ) to ensure $<1\%$ mAP drop at $w_w \geq 8$ for all YOLO variants.
- DSP Allocation Algorithm (Algorithm 1): Greedily increments per-layer parallelism factor $p_n$ in the layer yielding maximal latency reduction per DSP until the device’s DSP budget $R^{DSP}$ is exhausted.
- Buffer Allocation Algorithm (Algorithm 2): Relocates largest skip FIFOs off-chip as needed for resource balance.
Generation: Emits parameterized HLS/Vivado blocks, connects nodes via ready/valid FIFOs, inserts DMA interfaces for off-chip FIFOs, and performs place-and-route on the target FPGA.

Exposed top-level parameters include weight and activation bit-widths ( $w_w$ , $w_a$ ), per-layer parallelism $p_n$ , buffer placement $t_{nm}$ , and clock frequency $f_{clk}$ (200–270 MHz typical).

5. Empirical Performance and Comparisons

SATAY demonstrates competitive performance and energy characteristics across various FPGA devices and GPU baselines. Experimental results include:

Model (input size)	FPGA (Device)	Latency (ms)	Power (W)	GOP/s	GOP/s/DSP
YOLOv3-Tiny (416×416)	VCU118 (255 MHz)	6.8	42.9	875.7	0.13
YOLOv5s (640×640)	VCU118 (270 MHz)	14.9	67.0	1219.8	0.24
YOLOv8s (640×640)	VCU118 (240 MHz)	24.5	57.4	1244.0	0.18

In terms of GOP/s/DSP, SATAY outperforms contemporary FPGA accelerators: for YOLOv3-Tiny, achieving 0.24 (vs. 0.07–0.18); for YOLOv5s, 0.22–0.24 (vs. 0.03).

Energy efficiency analysis for YOLOv5n (320×320) on ZCU104 yields latency of 9.83 ms (FPGA, 14.8 W, 142.6 mJ) compared to 10.73 ms (Jetson TX2, 6.6 W, 70.7 mJ). Despite higher power consumption, the FPGA is approximately 1.1× faster. Across YOLO models, SATAY accelerators average 79× faster than ARM Cortex-A72 CPU inference and 3.6× faster than Jetson TX2 GPU.

6. Design Trade-Offs and Scalability

SATAY supports broad YOLO model coverage by instantiating or reconfiguring hardware blocks and adjusting DSE parameters, enabling rapid adaption to YOLOv3, v5, and v8, as well as "nano" and "small" variants. Resource/latency trade-offs follow from parallelism scaling: for example, YOLOv5s with per-layer $p_n$ as high as 16 reaches 14.9 ms on VCU118 ( $R^{DSP}=6815$ ); lower $p_n$ accommodates smaller FPGAs with increased latency.

Quantization to 8-bit weights ( $w_w=8$ , $w_a=16$ ) incurs $<1\%$ mAP loss on COCO-Val2017 for all YOLO versions. Off-chip FIFO placement reduces on-chip RAM consumption by up to 17%, with off-chip bandwidth remaining below 2 Gbps—orders of magnitude lower than available DDR channels.

A plausible implication is that the toolflow’s parameterization and algorithmic buffer placement enable efficient scalability from resource-constrained edge FPGAs to high-end accelerator platforms with minimal model-specific engineering.

7. Role in Edge-Based Real-Time Deployment

SATAY addresses the challenge of deploying computationally demanding object detection models in latency-sensitive, edge scenarios such as autonomous vehicles and medical imaging. By encapsulating streaming principles, hardware specialization, and quantitative design-space search, SATAY enables real-time YOLO inference with latencies consistently in the sub-tens-of-millisecond regime on widely available FPGAs, matching or exceeding GPU baselines under comparable operational constraints (Montgomerie-Corcoran et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SATAY.

SATAY: FPGA Architecture for Accelerating YOLO

1. Streaming Dataflow Architecture and Rationale

2. Performance Modeling and Pipelining

3. Buffering: On-Chip and Off-Chip Trade-Offs

4. Automated Toolflow and Design-Space Exploration

5. Empirical Performance and Comparisons

6. Design Trade-Offs and Scalability

7. Role in Edge-Based Real-Time Deployment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SATAY: FPGA Architecture for Accelerating YOLO

1. Streaming Dataflow Architecture and Rationale

2. Performance Modeling and Pipelining

3. Buffering: On-Chip and Off-Chip Trade-Offs

4. Automated Toolflow and Design-Space Exploration

5. Empirical Performance and Comparisons

6. Design Trade-Offs and Scalability

7. Role in Edge-Based Real-Time Deployment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research