Fused-layer Dataflow in DNN Accelerators

Updated 18 November 2025

Fused-layer dataflow is an architectural approach that groups consecutive DNN layers to retain intermediate activations on-chip, thus reducing redundant DRAM transfers.
It optimizes energy, latency, and memory bandwidth by tailoring tiling, recomputation, and parallel scheduling to hardware constraints.
Empirical studies demonstrate significant improvements, with metrics such as 30.6% memory cycles and up to 5.5x energy reduction in modern DNN accelerators.

Fused-layer dataflow refers to architectural and compilation strategies in deep learning accelerators that group multiple adjacent computational layers—commonly convolutions, but increasingly including other operator families—into a single, unified execution schedule or macro-kernel. This approach eliminates redundant off-chip data transfers and maximizes on-chip reuse by retaining intermediate activations locally across the fused region, optimally scheduling tiling and data movement to minimize energy, total data traffic, and latency. Fused-layer dataflow draws increasing interest for both dense and sparse workloads, with critical applications in reconfigurable dataflow architectures (RDAs), near-bank processing-in-memory (PIM), and memory-rich spatial DNN accelerators. The fused-layer paradigm exposes a rich design space—especially when heterogeneity in layer dimension, memory hierarchy, and data sparsity is considered—requiring sophisticated optimization models and heuristics supported by both analytical modeling and empirical simulation (Lacouture et al., 6 Nov 2025, Yang et al., 11 Nov 2025, Gilbert et al., 20 Sep 2024, Yang et al., 2022, Shi et al., 14 Jun 2024, Kao et al., 2022).

1. Core Definition, Motivation, and Baseline Comparison

Traditional DNN accelerator dataflow schedules process each layer independently: each layer reads its input feature map (IFM) and weights from DRAM, computes, and writes its output feature map (OFM) back to DRAM. This incurs three DRAM accesses per layer and results in significant external bandwidth and energy overhead due to spilling intermediate activations. In contrast, fused-layer dataflow fuses two or more consecutive layers, retaining intermediates in on-chip SRAM so that only the external inputs and final outputs traverse DRAM. For an $m$ -group fusion containing $n_p$ layers per group, the DRAM bandwidth is reduced to a single IFM-read at the group's head, multiple weight-reads (one per included layer), and a single OFM-write at the group's tail (Yang et al., 2022).

Key motivations include:

Drastic reduction of external memory traffic and associated energy costs by maximizing on-chip data reuse (Gilbert et al., 20 Sep 2024, Yang et al., 11 Nov 2025).
Amortization of memory bandwidth and transfer-time bottlenecks, observed especially in near-bank PIM and multi-bank on-chip SRAMs (Yang et al., 11 Nov 2025, Shi et al., 14 Jun 2024).
The ability to exploit tiling, recomputation, and parallelism as first-class tunables for trade-offs between buffer size, bandwidth, and compute in the presence of limited on-chip memory (Gilbert et al., 20 Sep 2024, Kao et al., 2022).
Reduced latency by collapsing sequences of DRAM read–compute–write operations into a single, multi-layer on-chip schedule (Yang et al., 2022).

2. Formal Models and Optimization Objectives

Fused-layer dataflow design and scheduling is naturally cast as a discrete optimization problem, involving both partitioning and schedule selection. The generalized model as seen in FuseFlow and LoopTree is:

$\min_{\{G_j\}} \sum_{j=1}^{p} C(G_j) \quad\text{where}\quad C(G_j) = \alpha\cdot \mathrm{Flops}(G_j) + \beta\cdot \mathrm{Bytes}(G_j)$

— $G_j$ is a group (possibly a fusion region), $\mathrm{Flops}(G_j)$ accounts for total arithmetic (including any recomputation due to not retaining some intermediates), and $\mathrm{Bytes}(G_j)$ for total DRAM traffic. The cost weight $\alpha$ and $\beta$ reflect the relative scarcity of compute and bandwidth (Lacouture et al., 6 Nov 2025). In CMDS, energy and latency are co-optimized, with cross-layer reshaping overheads explicitly modeled:

$\min_{\{x,y\}} \sum_{l=1}^N \left[E^{\,\rm comp}_l(SU_l) + E^{\,\rm mem}_l(SU_l,y_l)\right] + \alpha\sum_{l=1}^N T_l(SU_l,y_l) + \sum_{l=1}^{N-1} C_{\rm resh}(l,l+1; SU_l, SU_{l+1}, y_l, y_{l+1})$

(Shi et al., 14 Jun 2024). Off-chip bandwidth, buffer footprint, and computational cost are all explicit outputs of the mapping.

LoopTree introduces a generalized taxonomy and analytical model. For each buffer level $b$ :

$C_b = \max_i \sum_{T\in Retain(b)} \operatorname{SizeOf}(Tile_b^T)$

$V_{\rm off} = \sum_{T} \#fetches_T \times SizeOf(Tile^T) + \sum_{I} \#recomputes_I \times SizeOf(needed\;inputs)$

Latency is:

$L_{seq} = \max(L_{comp}, \max_b L_{mem}(b)),\qquad L_{comp} = \sum_{tiles} \frac{Ops(tile)}{P_{peak}}$

(Gilbert et al., 20 Sep 2024). Performance is dominated by the buffer–compute–memory trade-off surfaces.

3. Compilation, Scheduling, and Dataflow Graph Models

Code generation for fused-layer dataflow departs from single-layer, loop-nest-centric IRs in favor of graph-based and tokenized models:

In FuseFlow, the SAMML intermediate representation expresses computation as a DAG of stream operators connected by typed FIFOs carrying coordinate, reference, or value tokens; operator types include LevelScanner, Intersect, Repeater, ALU, Reducer, and data writers. Fused subgraphs allow sharing of computation nodes (e.g., merged Intersect and ALU nodes), with partial order graphs (POG) enforcing consistent iteration among index variables (Lacouture et al., 6 Nov 2025).
For dense CNNs/MLPs, mappings are represented as sequences of tiling, permutation, and fusion tokens parameterized by hardware constraints (buffer size, PE array geometry). DNNFuser encodes such sequences as inputs to a Transformer, supporting loop order, retention/recompute, and fusion boundaries, along with constraint-masked legality checks (Kao et al., 2022).
CMDS and LoopTree further expand the mapping to include cross-layer spatial/temporal unrolling, memory bank layouts, and reshaping steps, all modeled explicitly in the mapping and schedule (Shi et al., 14 Jun 2024, Gilbert et al., 20 Sep 2024).

Parallelization and scheduling flexibility are key: FuseFlow can map any legal POG ordering, supports explicit user schedule and fusion boundaries (\texttt{Fuse{}} regions), and can parallelize along specified dimensions by duplicating subgraphs and stream partitions (Lacouture et al., 6 Nov 2025). LoopTree distinguishes between data retention, recomputation, and per-tensor policies, enabling fine-grained buffer and recompute allocation (Gilbert et al., 20 Sep 2024).

4. Data Movement, Memory Hierarchy, and Resource Allocation

The principal efficiency gains of fused-layer dataflow arise from optimized data movement and buffer usage.

In near-bank DRAM-PIM, PIMfused avoids cross-bank shuffles by assigning spatial output tiles to banks and streaming fused multi-layer kernels on each tile, only invoking expensive cross-bank reorganization at fusion region boundaries (Yang et al., 11 Nov 2025).
LoopTree shows that tiling along selected tensor ranks allows intermediate output tiles to reside on-chip just long enough for immediate consumption, minimizing on-chip SRAM footprint; per-layer or per-tensor recomputation is used strategically when buffer capacity is insufficient (Gilbert et al., 20 Sep 2024).
CMDS exploits multi-bank memory layouts to efficiently reshape between producer and consumer layers, bypassing the need for large reshuffle buffers by using bank-level multiplexing for slice/access realignment, thus keeping additional area and energy overhead below 3% (Shi et al., 14 Jun 2024).

Retention/recompute points, tile shapes, and loop processing schedule are the main axes by which one achieves Pareto-optimal trade-offs. LoopTree quantifies the impact: tile schedule choices can modulate buffer demand by up to 10 $\times$ for the same off-chip traffic, and permitting per-fmap (rather than uniform) recompute policies further reduces SRAM demand (Gilbert et al., 20 Sep 2024).

5. Performance Metrics, Quantitative Results, and Case Studies

Fused-layer dataflow yields substantial, quantifiable improvements across key hardware metrics:

On ResNet-18 (PIMfused, 4-bank), memory cycles are reduced to 30.6% of a GDDR6-AiM-like baseline, energy to 83.4%, and chip area to 76.5% (Yang et al., 11 Nov 2025).
For VGG-16 (pre-RTL HW evaluator), fusing at every pooling boundary achieves 55.6% memory bandwidth reduction, a 36.7% latency improvement, and a 49.2% drop in total energy over naive layer-by-layer (Yang et al., 2022).
FuseFlow reports speedups from 1.01 $\times$ up to 3.9 $\times$ for sparse models (including autoencoder, GCN, GraphSAGE), and a 2.7 $\times$ latency improvement for GPT-3 with BigBird block-sparse attention (Lacouture et al., 6 Nov 2025).

Validation of analytic models against simulator or hardware prototypes reveals typical accuracy in the 1–4% range for latency, energy, and buffer estimation (Gilbert et al., 20 Sep 2024, Lacouture et al., 6 Nov 2025). The choice of tiling schedule, fusion granularity, and recompute policy is highly workload-dependent; there is no single universally optimal configuration.

Quantitative example from LoopTree (conv+conv, H=W=128, C=M=64):

Schedule	SRAM req’d	Off-chip Volume (MB)
Layerwise	4.2 MB	1.00
[P₂]	0.42 MB	1.00
[C₂]	0.58 MB	1.00
[P₂,C₂]	0.39 MB	1.00

(Gilbert et al., 20 Sep 2024)

DNNFuser further demonstrates that a Transformer-based learned mapper can replicate fused-layer mapping quality at 90–95% of optimal, with $10^3$ – $10^4\times$ speedups in mapping time relative to search-based baselines (Kao et al., 2022).

6. Implementation Trade-Offs and Hardware Constraints

Fused-layer dataflow effectiveness depends on several architectural and system-level constraints:

Wider fusion regions maximize potential traffic/energy savings, but require sufficient on-chip memory for activations and weights. There are diminishing returns: once DRAM accesses for weights dominate, further fusion is less impactful (Yang et al., 2022).
Control and scheduling complexity rises with fusion depth. FSMs must manage multi-layer tiling, overlapping prefetches, buffer ping-ponging, and synchronization across PEs/banks (Yang et al., 2022, Yang et al., 11 Nov 2025).
Cross-layer data layout compatibility is critical; hardware structures (multi-bank SRAMs, port multiplexing) may be necessary to allow efficient reshaping and ensure PE/NoC utilization (Shi et al., 14 Jun 2024).
Allowing recomputation as a buffer–compute trade-off can yield up to a 2 $\times$ reduction in buffer requirements for a small increase in compute, especially in low-buffer regimes (Gilbert et al., 20 Sep 2024).

PIMfused experiments show that a modest local buffer (LBUF 128–256 B) per PIMcore captures most activation reuse; global buffer (GBUF) sizing is more critical for fused weight broadcast (Yang et al., 11 Nov 2025). CMDS achieves energy reductions up to 5.5 $\times$ over naive per-layer memory-unaware scheduling, with negligible area impact from its bank-multiplexing reshuffle technique (Shi et al., 14 Jun 2024).

7. General Design Guidelines and Future Directions

Repeated across multiple works are robust design principles for realizing efficient fused-layer dataflow architectures:

Express data movement and fusion control as explicit user-schedule primitives, not “fully automatic” heuristics (Lacouture et al., 6 Nov 2025, Gilbert et al., 20 Sep 2024).
Maintain global partial-order graphs for index and data layout dependencies to steer fused iteration legality (Lacouture et al., 6 Nov 2025).
Support flexible fusion boundaries, per-tensor retention/recompute, and independent tiling for workload-specific Pareto optimization (Gilbert et al., 20 Sep 2024, Kao et al., 2022).
Thoroughly model on-chip buffer hierarchy, memory bandwidth/utilization, and cross-layer reshaping with respect to actual hardware structure (multi-bank, ported SRAM) (Shi et al., 14 Jun 2024, Yang et al., 11 Nov 2025).
Prune suboptimal configurations early using analytical cost models of compute (Flops), memory traffic (Bytes), and latency (Lacouture et al., 6 Nov 2025, Gilbert et al., 20 Sep 2024).
Validate mapping strategies using cycle-accurate simulators and, where feasible, real hardware (Lacouture et al., 6 Nov 2025).

Fused-layer dataflow has become a primary lever for scalable, energy-efficient DNN accelerator design, and remains central for sparse (Transformer/LLM) and dense (CNN) workloads alike. Its adoption continues to expand as model size, hardware heterogeneity, and memory system complexity increase.

References

FuseFlow: "FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow" (Lacouture et al., 6 Nov 2025)
PIMfused: "PIMfused: Near-Bank DRAM-PIM with Fused-layer Dataflow for CNN Data Transfer Optimization" (Yang et al., 11 Nov 2025)
LoopTree: "LoopTree: Exploring the Fused-layer Dataflow Accelerator Design Space" (Gilbert et al., 20 Sep 2024)
Pre-RTL DNN Evaluator: "Pre-RTL DNN Hardware Evaluator With Fused Layer Support" (Yang et al., 2022)
CMDS: "CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories" (Shi et al., 14 Jun 2024)
DNNFuser: "DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators" (Kao et al., 2022)