Cascaded-Chunk Feed Forward Network (CCFFN)
- The paper introduces CCFFN, a novel architecture that replaces a monolithic FFN with a cascade of chunkwise mini-FFNs to significantly reduce parameter count and computational cost.
- CCFFN partitions the channel dimension into chunks and employs cascaded residual connections, enabling lightweight, sequential feature refinement while preserving representational capacity.
- Empirical results on ImageNet-1K demonstrate that CCFFN lowers FLOPs, energy consumption, and memory usage, making it particularly suitable for deployment on mobile and edge devices.
A Cascaded-Chunk Feed Forward Network (CCFFN) is a feedforward subnetwork architecture introduced as a core component of the Cascaded-ViT (CViT) family of efficient vision transformers. CCFFN replaces the conventional monolithic feedforward network (FFN) block in transformers with a sequential cascade of smaller, chunkwise feedforward units. By partitioning the channel dimension, applying lightweight mini-FFNs to each chunk, and cascading information inter-chunk via residual connections, CCFFN achieves significant savings in parameter count, computational cost, GPU memory, and energy consumption, while maintaining comparable representational capacity and accuracy. This architectural innovation enables deployment of vision transformers on resource-constrained platforms such as mobile phones and drones, as demonstrated by empirical results on ImageNet-1K and metrics quantifying energy and computational efficiency (Sivakumar et al., 18 Nov 2025).
1. Architectural Foundation
The CCFFN fundamentally restructures the standard FFN found in transformers. In a vanilla FFN, input (typically with tokens and model dimension ) passes sequentially through two learned linear projections, sandwiching a ReLU non-linearity and utilizing a channel-wise expansion factor (i.e., ). This results in a single, wide feedforward block with no explicit internal structure or intra-block dependencies.
In contrast, CCFFN divides the channel axis of into equally sized chunks, each of width . Each chunk is routed through a dedicated, smaller feedforward network. These mini-FFNs are iteratively cascaded: the output of each chunk influences the input to the next via a residual connection. The first chunk is processed directly; each subsequent chunk is processed after receiving the residual addition of the previous chunk's FFN output:
- For :
- For :
Finally, all outputs are concatenated along the channel dimension to reconstruct the output of the overall CCFFN.
2. Mathematical Formulation
Let be the CCFFN input and the number of chunks; define . The CCFFN proceeds through the following stages:
- Chunk Splitting: with .
- Cascaded Inputs:
- Mini-FFN per Chunk:
- Recombination:
This cascade enables later chunks to condition on intermediate transformed features from prior chunks, while exploiting the parameter and compute sharing enabled by chunking.
3. Computational Complexity and Efficiency
The efficiency gains of CCFFN are established by explicit parameter and FLOP analyses relative to the vanilla transformer FFN:
- Vanilla FFN:
- Parameters:
- FLOPs per token:
- CCFFN (C chunks):
- Parameters:
- FLOPs per token:
Thus, both parameters and FLOPs are theoretically reduced by a factor of $1/C$. In CViT, is typical, yielding a 50% reduction in per-block complexity. Empirically, the realized net parameter saving (compared to EfficientViT) is approximately 20%, due to the use of a reduced expansion ratio in baseline models. The reduction in FLOPs is around 15% at the network scale, verified on ImageNet-1K [(Sivakumar et al., 18 Nov 2025), Table 2]. Memory consumption—both live and reserved—also decreases due to the reduced width of intermediate representations.
4. Empirical Performance and APF Metric
CCFFN's practical impact is supported by benchmark studies on ImageNet-1K. CViT models employing CCFFN consistently achieve strong top-1 accuracy while reducing computational and energy requirements:
- CViT-M vs EfficientViT-M2: Top-1: 69.9% vs 70.8%, FLOPs: 173M vs 201M, Params: 3.5M vs 4.2M, Energy: 568mJ vs 581mJ.
- CViT-XL vs EfficientViT-M5: Top-1: 75.5% vs 77.1%, FLOPs: 435M vs 522M, Energy: 653mJ vs 675mJ.
- Average latency (iPhone 15 Pro): 6–13% reduction over EfficientViT across model sizes.
To quantify compute efficiency relative to accuracy, the paper introduces the Accuracy-Per-FLOP (APF) metric:
Across the CViT family, CCFFN delivers the highest or comparable APF within its model category, outperforming EfficientViT and EfficientFormerV2 in several cases (e.g., CViT-XL: 28.6 vs EfficientViT-M5: 28.4) [(Sivakumar et al., 18 Nov 2025), Table 6].
5. Suitability for Edge Deployment
CCFFN's architectural sparsity and low memory profile make it particularly suitable for edge and mobile settings. Empirically, both live and reserved GPU memory footprints are reduced due to the shallow intermediate activations in the chunked mini-FFNs. On-device benchmarks show that CViT equipped with CCFFN consumes on average 5% less energy per image (as reported on MacBook Pro M4 Pro), with end-to-end inference latencies 6–13% lower on mobile-class hardware relative to EfficientViT.
The cascaded residual connections between mini-FFNs allow hierarchical refinement of features, facilitating competitive representational capacity with substantially reduced parameter burden—a property advantageous for bandwidth- and energy-constrained inference.
6. Summary and Significance
The CCFFN module implements a strategy wherein each monolithic FFN is replaced by smaller cascade-connected FFNs. Key differentiators, grounded in the presented empirical evidence, are summarized in the table below:
| Property | Vanilla FFN | CCFFN () |
|---|---|---|
| Parameters/block | ||
| FLOPs/block | ||
| Empirical net param savings | — | ~20% (over EfficientViT) |
| Energy per image | baseline | 3–5% lower |
| Mobile latency | baseline | 6–13% reduction |
CCFFN thus delivers a quantifiable, systematic improvement in compute- and memory-efficiency, validated on large-scale benchmarks and edge devices. The design preserves or enhances accuracy-per-resource by structuring intra-block computation as a cascade of lightweight, interactively coupled transformations (Sivakumar et al., 18 Nov 2025).