Double-Buffering-Aware Interconnect

Updated 30 June 2025

Double-Buffering-Aware Interconnect is an architecture that employs two alternating buffers to decouple producer and consumer operations, enabling concurrent data transfer and computation.
It enhances system performance in many-core, CGRA, and real-time SoC designs by overlapping memory transfers with computation and minimizing latency.
Recent implementations showcase that techniques such as triple-buffering, FIFO insertion, and credit-based arbitration can significantly boost throughput while balancing area and power constraints.

A double-buffering-aware interconnect is an interconnection architecture or generation methodology that explicitly supports and exploits double-buffering in digital systems, particularly in many-core platforms, coarse-grained reconfigurable arrays (CGRAs), and real-time system-on-chip (SoC) designs. Double-buffering—the use of two alternating buffers per data stream to decouple producer and consumer operations—enables the overlap of computation and communication, thus maximizing throughput and minimizing resource idling due to memory or communication latency. Increasing system integration, deeper memory hierarchies, and more heterogeneous computational elements have prompted significant research into interconnects that are aware of and optimized for double-buffered workloads, as reflected in recent work across many-core CNN accelerators (Bytyn et al., 2020), CGRA interconnect generators (Melchert et al., 2022), and real-time SoC arbitration mechanisms (Benz et al., 2023).

1. Principles of Double-Buffering in Digital Architectures

Double-buffering refers to maintaining two buffers so that one can be filled (written) by a producer while the other is processed (read) by a consumer, thus decoupling their operations. In interconnect design, being "double-buffering-aware" entails providing both micro-architectural and protocol-level support to ensure that the timing and control of data movement between buffers and across the interconnect does not stall pipelines, even in the presence of variable memory or peer latency.

The practical purpose of double-buffering in these contexts is to:

Overlap memory transfers (e.g., DMA) with computation, avoiding compute stalls.
Reduce effective memory and interconnect latency as seen by processing elements.
Smooth or hide the variance from intermittent high-latency accesses or communication congestion.

In advanced designs, triple buffering may be used to further decouple prefetch, compute, and write-back phases, but the architectural considerations remain closely related.

2. Mapping and Optimization for Double-Buffering-Aware NoCs and Many-Core Systems

In "Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect" (Bytyn et al., 2020), double- and triple-buffering are central to achieving efficient dataflow mapping for CNN workloads.

Key aspects of their mapping strategy include:

Single-core optimizations: Tiling and unrolling parameters are selected so that the working set per tile fits on-chip, with the inner-loop runtime being constrained by the maximum of compute cycles and DRAM transfer cycles. Buffer constraints require that all tile data, weights, and intermediate results (such as partial sums) are stored in fast local SRAM, using triple-buffering to ensure continuous operation.
Many-core extensions: Layers are sliced along feature map and channel dimensions, with each slice assigned to a processing core. Interconnect support for rapid data movement between DRAM and core-local SRAM is critical, and double/triple-buffering at the on-chip interconnect ensures that data prefetch (for the next tile), compute (on the current tile), and store (for the previous tile) can happen concurrently.

The formalization of runtime, DRAM access, and buffer allocation constraints (see Eqs. 13–25 and Eq. 23 in (Bytyn et al., 2020)) directly incorporates the number of buffer slots per data class. The empirical findings demonstrate that, with sufficient buffering and intelligent interconnect mapping, core utilization remains high except when limited by DRAM bandwidth or NoC contention.

A summary of optimized system parameters is provided in the following table:

Level	Parameters Optimized	Constraints	Buffering
Single Core	Tiling, unrolling	SRAM size, HW parallelism	Triple buffer
Many Core	Slice, core assignment	Load balance, NoC/DRAM contention	Triple buffer

Triple-buffering supports out-of-order DMA transactions and hides DRAM latency, essential for high-throughput CNN layers.

3. Programmable Interconnects for Double-Buffering in CGRAs

"Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays" (Melchert et al., 2022) addresses double-buffering at the interconnect generation and hardware synthesis stage. In Canal:

Graph-based IR: The interconnect is specified as a directed graph where nodes can represent switch boxes, connection boxes, and explicitly controlled buffer elements (such as FIFO buffers). For double-buffering, buffer nodes annotated with depth 2 are inserted between producer-consumer pairs or along critical edges.

Example Canal eDSL pseudocode:

1
2
3

fifo_node = Node(type="FIFO", depth=2, x=1, y=1)
tile1_output.add_edge(fifo_node)
fifo_node.add_edge(tile2_input)

This directly instantiates a double buffer between connected tiles.

Ready-Valid Signaling: The interconnect's ready/valid protocol is extended to handle dynamic flow control—depth-2 FIFOs or distributed split FIFOs (registers chained across multiple switch boxes) provide buffer slots for double buffering, efficiently handling two in-flight data elements.
Switch box topology, routing tracks, and buffer placement: The number and location of buffered (double-buffered) connections are key design parameters during interconnect synthesis. More flexible switch boxes support dynamic rerouting in the case of buffer backpressure. Increasing routing track counts supports higher in-flight data rates to fully exploit double-buffered bandwidth.

Empirical findings include area/performance trade-offs: adding depth-2 FIFOs to all switch boxes increases area by 54%, but Canal's split FIFO optimization reduces overhead to 32%. The methodology enables programmatic exploration of buffering strategies and the resulting effect on throughput, latency, and area.

4. Arbitration for Double-Buffering-Aware Interconnects in Real-Time SoCs

"AXI-REALM: A Lightweight and Modular Interconnect Extension for Traffic Regulation and Monitoring of Heterogeneous Real-Time SoCs" (Benz et al., 2023) provides fine-grained bandwidth partitioning and latency minimization for double-buffered traffic in modern SoCs.

Key features include:

Credit-Based Bandwidth Regulation: Each manager (e.g., CPU core, DMA engine) receives a periodic bandwidth budget $B$ over reservation interval $T$ , enforced per-interconnect region. This restricts the amount of traffic a double-buffered engine can inject, preventing starvation of latency-sensitive actors.

$BW_i = \frac{b_i}{t}$

When a transaction of size $s$ is issued:

$b_i \leftarrow b_i - s$

If $b_i < s$ , further transactions are stalled until budget is replenished.

Burst Fragmentation and Buffering Minimization: Long, bursty memory transfers characteristic of double-buffered DMA are fragmented into short segments, ensuring fair interleaving and low-latency service for all managers. The write buffer holds only enough data to cover the current fragmented burst, minimizing area and latency overhead (area overhead measured at 2.45%).
Monitoring and Tuning: A dedicated unit collects per-manager, per-region traffic and latency statistics. This helps diagnose contention, adjust budgets, and optimize performance.

Empirical evidence demonstrates the impact: in a system with a DMA issuing 256-beat bursts (double-buffered transfer), the CPU's performance under contention is only 0.7% of its isolated maximum; introducing AXI-REALM's burst fragmentation and crediting recovers CPU throughput to 68.2% (for equal bandwidth) and over 95% (when configured in favor of the CPU), while worst-case memory access latency drops below 8 cycles.

5. Design Trade-Offs and System-Level Implications

Double-buffering-aware interconnect architectures must navigate several trade-offs:

Buffer Placement and Depth: Sufficient buffering is required to decouple computation and communication, but excessive buffering increases area and can complicate timing closure.
- Empirical results show that judicious use of double or triple buffering (rather than indiscriminate insertion) achieves high throughput with manageable area impact (Melchert et al., 2022, Bytyn et al., 2020).
Bandwidth Partitioning versus Utilization: Credit-based schemes (Benz et al., 2023) enforce determinism and prevent starvation, but setting overly stringent budgets can leave resources underutilized. Dynamic monitoring can mitigate this by adjusting parameters at runtime.
Interconnect Topology and Flexibility: Highly flexible switch box topologies and increased track counts (Melchert et al., 2022) enable robust routing around full buffers but come at a cost in area and complexity.
Scalability Limits: While double-buffering-aware designs can maximize per-core throughput and hide memory latency up to a point, global bandwidth limitations (e.g., DRAM interface width, NoC bisection) set a ceiling on achievable scaling (Bytyn et al., 2020). In many-core CNN accelerators, performance saturates beyond 14–16 cores for real network layers due to DRAM bottlenecks.

6. Challenges and Directions

While double-buffering-aware interconnects offer improved throughput and latency tolerance, several persistent challenges remain:

Deadlock and Livelock: Complex interconnect topologies with distributed flow control and cyclic dependencies can deadlock if buffer management is not correctly architected (Melchert et al., 2022).
Verification of Timing Guarantees: Especially in mixed-criticality systems, wisdom in credit assignment, buffer sizing, and arbitration policy is required to ensure timing predictability with minimal area overhead (Benz et al., 2023).
Area and Power Optimization: Shared FIFO designs, split buffers, and careful topological and buffer-planning are necessary to keep area and power growth contained as buffering is adopted system-wide (Melchert et al., 2022).

A plausible implication is that future design methodologies will tightly couple buffer management, interconnect generation, and application mapping to co-optimize throughput, area, and determinism in the presence of double-buffered workloads.

7. Comparative Features and Practical Summary

Interconnect/Approach	Buffering Mechanism	Flow Control / Arbitration	Impact
Many-core CNN NoC (Bytyn et al., 2020)	Triple buffer (SRAM ofmaps)	Dataflow scheduling, NoC-aware	Hides latency, enables concurrency, limited by DRAM bandwidth
Canal CGRA Interconnect (Melchert et al., 2022)	Inserted FIFOs (depth=2), split FIFO	Ready-valid handshake, graph IR	Flexible, tunable buffering, area/performance tradeoff
AXI-REALM (Benz et al., 2023)	Burst fragmentation, minimal buffer	Credit-based, periodic budget	Predictable latency, starvation protection, low area overhead

These approaches demonstrate that double-buffering awareness is now a key architectural feature of high-performance, scalable, and predictable interconnects in modern accelerator and SoC design. The explicit modeling and optimization of buffering, flow control, and arbitration are essential to unlock high throughput and latency hiding in the face of growing heterogeneity and integration density.

PDF Markdown Chat (Upgrade)

References (3)

1.

Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect (2020)

2.

Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays (2022)

3.

AXI-REALM: A Lightweight and Modular Interconnect Extension for Traffic Regulation and Monitoring of Heterogeneous Real-Time SoCs (2023)