Cascaded-ViT: Lightweight Vision Transformers

Updated 25 November 2025

Cascaded-ViT is a family of lightweight, compute-efficient vision transformers employing cascaded feedforward and group attention modules to enhance feature representation and reduce computational cost.
Its Cascaded-Chunk Feed Forward Network (CCFFN) and Cascaded Group Attention (CGA) enable progressive refinement of features while cutting down FLOPs and energy, optimizing performance on edge devices.
Empirical results demonstrate that CViT variants achieve competitive ImageNet accuracies with lower parameters and energy consumption, making them ideal for deployment in resource-constrained environments.

Cascaded Vision Transformer (CViT) and related Cascaded-ViT variants constitute a family of lightweight, compute-efficient transformer models designed for high-throughput vision tasks on resource-constrained hardware. These architectures are characterized by innovations in hierarchical feedforward design, group attention mechanisms, and conditional inference routing, collectively improving parameter efficiency, computational footprint, and energy consumption while preserving accuracy. Two primary instantiations—the CCFFN/CGA-based CViT (Sivakumar et al., 18 Nov 2025) and the coarse-to-fine inference CF-ViT (Chen et al., 2022)—demonstrate the breadth of the cascaded paradigm in contemporary vision transformers.

1. Architectural Innovations in CViT

CViT, as introduced in "CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer" (Sivakumar et al., 18 Nov 2025), iteratively advances EfficientViT by integrating the Cascaded-Chunk Feed Forward Network (CCFFN) in place of conventional position-wise FFNs and retaining a cascaded group attention (CGA) module. The architecture is modularized into four variants—S, M, L, XL—distinguished by their depth per stage, embedding dimensions, head allocation, and model scale. Each stage comprises a fixed number of transformer blocks with progressive expansion of feature dimensions and increased representational capacity.

A summary of the CViT variant specifications is as follows:

Model	Depth per Stage	Embedding Dims	Attention Heads	Parameters (M)	FLOPs (M)
CViT-S	[1, 2, 3]	[64, 128, 192]	[2, 3, 3]	1.9	67
CViT-M	[1, 2, 3]	[128, 192, 224]	[4, 3, 2]	3.5	173
CViT-L	[1, 2, 3]	[128, 256, 384]	[4, 4, 4]	7.0	249
CViT-XL	[1, 3, 4]	[192, 288, 384]	[3, 3, 4]	9.8	435

This parameterization allows for adaptive deployment scenarios, from mobile-class inference (CViT-S) to higher-accuracy, moderate-resource tasks (CViT-XL).

2. Cascaded-Chunk Feed Forward Network (CCFFN)

CCFFN implements a hierarchical, chunk-based refinement strategy for token representations within each transformer block. The canonical FFN is recast in a key-value formulation: $\mathrm{FFN}(X) = f(X K^T) V$ , where $X \in \mathbb{R}^{T \times d}$ and $f(\cdot)$ denotes ReLU activation. CCFFN partitions $X$ into $n$ contiguous channel chunks $\{X_i\}$ :

Chunk Splitting: $X_1, X_2, ..., X_n = \mathrm{Split}(X)$
Cascaded Input: $X'_i = X_i$ for $i=1$ ; $X'_i = X_i + Y_{i-1}$ for $i > 1$
Chunkwise Processing: $Y_i = \mathrm{FFN}_i(X'_i)$
Output Aggregation: Concatenate $\{Y_1,...,Y_n\}$ along the channel dimension

CCFFN halves the typical FFN expansion ratio (2× vs 4×) and splits computation into $n$ smaller, sequential FFNs. This yields parameter and FLOP reductions of approximately 20% and 15% relative to EfficientViT FFNs. The cascading connection enables progressive feature refinement, thereby enhancing depth and capacity without additional parameter cost (Sivakumar et al., 18 Nov 2025).

3. Cascaded Group Attention (CGA)

CGA is a cost-reduction strategy within the self-attention module, partitioning token features across $G$ groups along the channel axis. For each group $g$ :

Compute queries, keys, and values: $Q^{(g)} = X^{(g)} W_Q^{(g)}$ , $K^{(g)} = X^{(g)} W_K^{(g)}$ , $V^{(g)} = X^{(g)} W_V^{(g)}$
Scaled Dot-Product Attention: $H^{(g)} = \mathrm{Softmax}\left(\frac{Q^{(g)} (K^{(g)})^T}{\sqrt{d_g}}\right) V^{(g)}$
Output Cascading: $O^{(g)} = \sum_{i=1}^g H^{(i)}$

Output representations from all groups are concatenated, ensuring that each group accumulates attention from its predecessors, which increases representational expressivity with minimal computational overhead. This grouped, cascaded aggregation generalizes vanilla multi-head self-attention while offering parameter and compute reductions relevant to edge deployment scenarios (Sivakumar et al., 18 Nov 2025).

4. Compute Efficiency: The APF Metric and Empirical Results

The Accuracy-Per-FLOP (APF) metric quantifies model efficiency using:

$\mathrm{APF} = \frac{\mathrm{Top\text{-}1\ Accuracy} (\%)}{\log_{10}(\mathrm{FLOPs\ in\ MFLOP})}$

The logarithmic denominator reflects diminishing returns in accuracy gains for additional computation.

On ImageNet-1K, CViT models consistently demonstrate reduced FLOPs, memory, and energy consumption relative to EfficientViT at similar accuracy levels. Notable empirical results include:

CViT-S: 62.0% Top-1, 67M FLOPs, 1.9M params, 471mJ/image
CViT-M: 69.9%, 173M, 3.5M, 568mJ/image
CViT-L: 73.0%, 249M, 7.0M, 588mJ/image
CViT-XL: 75.5%, 435M, 9.8M, 653mJ/image

Comparisons indicate energy savings of 3–5% and average parameter reductions of 5–17% relative to equivalent EfficientViT baselines, with negligible accuracy penalty. For instance, CViT-L achieves 1.3% lower accuracy than EfficientViT-M4 while consuming 17% fewer FLOPs and 15% less GPU memory (Sivakumar et al., 18 Nov 2025).

5. Alternate Paradigm: Coarse-to-Fine (Cascaded) Inference

The cascading concept also underpins the CF-ViT architecture (Chen et al., 2022), which employs a two-stage, confidence-gated inference pipeline:

Coarse Stage: The image is split into a small sequence of large patches. A transformer encoder estimates the class prediction.
Fine Stage: Triggered when coarse confidence $p^c_{\max}$ falls below threshold $\eta$ . Informative patches (selected via global class attention EMA) are subdivided, generating longer fine-grained token sequences for the same transformer encoder (weights shared). Coarse-stage features are re-injected through an MLP-based feature reuse path.

The overall expected FLOPs is:

$\mathbb{E}[\mathrm{FLOPs}] = \mathrm{FLOPs}_c + (1 - R) \mathrm{FLOPs}_f$

with $R$ denoting the early-exit fraction.

Experiments on DeiT-S and LV-ViT-S backbones demonstrate that CF-ViT can reduce inference FLOPs by over 50% without accuracy loss, and can double throughput (e.g., from 2601 img/s to 4903 img/s) relative to standard non-cascaded ViTs. The model adaptively routes easy images for early exit while allocating additional computation to ambiguous instances (Chen et al., 2022).

6. Ablation Studies and Design Analysis

Comprehensive ablations on both CCFFN and CF-ViT validate the contribution of individual components:

CCFFN: Swapping with paired FFN weight-sharing (4× expansion) increases parameters (+0.3M) and energy (+7.6%), whereas CCFFN achieves a 1% accuracy gain while cutting compute costs by 15–20%. Optimal performance is reported for chunk count $n=2$ and expansion ratio 2.5–4×, with excessive chunking or removal of the cascade decreasing accuracy by up to 3% (Sivakumar et al., 18 Nov 2025).
CF-ViT: Using global class-attention for patch selection outperforms using only the last attention layer or random regions. Feature reuse is essential—omitting it reduces accuracy by ~0.8%. Selection rate $\alpha$ modulates the trade-off between accuracy and FLOPs; CE+KL objective consistently yields the highest fine-stage accuracy (Chen et al., 2022).

These results delineate the effectiveness of hierarchical and cascaded computation in minimizing resource requirements while maintaining or enhancing task accuracy.

7. Deployment and Practical Considerations

Both CViT and CF-ViT architectures are oriented toward real-time deployment on edge devices such as mobile phones, drones, and embedded systems. Key deployment metrics include:

Memory: CViT-L uses ~15% less GPU memory than EfficientViT-M4.
Latency: On iPhone 15 Pro, CViT-S/M/L/XL exhibit per-image latencies of 0.39/0.45/0.70/0.86 ms, outperforming EfficientViT counterparts.
Throughput: CViT variants achieve higher images/sec across GPU, CPU, and AI accelerators.
Energy: Measured on Apple M4 Pro, CViT reduces per-image energy consumption by 3–5% across scales.

A plausible implication is that cascaded architectural motifs—in both token-wise processing (CCFFN, CGA) and conditional inference (coarse-to-fine)—can substantially improve the feasibility of transformer models for battery-constrained and latency-sensitive environments, without significant trade-offs in classification accuracy (Sivakumar et al., 18 Nov 2025, Chen et al., 2022).