C2f Module in Neural Networks

Updated 18 December 2025

C2f Module is a neural network paradigm that partitions processing into coarse and fine stages to progressively refine outputs.
It employs multi-stage cascades and split-merge block designs to enhance gradient flow, feature representation, and overall computational efficiency.
Empirical studies in object detection, image restoration, and segmentation demonstrate notable accuracy improvements with minimal computational overhead.

A C2f ("Coarse-to-Fine") module is a neural network architectural or algorithmic construct that systematically partitions computation into sequential stages of differing granularity or abstraction. The coarse stage carries out an initial, semantically broad or computationally simplified operation, while the fine stage incrementally refines, corrects, or localizes the output. The C2f paradigm is instantiated in diverse research domains—including CNN backbones, diffusion models, temporal reasoning, and vision-language tasks—with specific module architectures and optimization regimes tailored to the application. Prominent C2f modules deliver improvements in optimization, interpretability, computational efficiency, and accuracy by exploiting hierarchical information processing pathways.

1. Canonical C2f Module Architectures

The C2f architectural motif is realized across neural models predominantly in two forms: explicit two-stage (or multi-stage) cascades and multi-path or split-merge block designs.

Split-and-Stacked-Conv C2f (YOLOv5s/YOLOv8, Fab-ME): The canonical C2f block, as in YOLOv8 and its variants, replaces bottleneck-centric "C3" modules. It comprises an initial $1\times 1$ convolution that expands (or maintains) the feature channel width, followed by multiple lightweight Conv+BN+SiLU sub-modules (the "f" branches) applied in series. Intermediate outputs (including the bypassed identity path) are concatenated and fused via a final $1\times 1$ convolution. This design increases the number of gradient-propagation paths, improving feature flow and representation reuse. For example, in YOLOv5s-C2f, $n+1$ parallel paths are created, compared to only two in C3 (He et al., 2023).
C2f-VMamba (Fab-ME): The C2f-VMamba block replaces standard C2f with an integration of Visual State-Space (VSS) modules. After the initial $3\times 3$ convolution and channel split, one path proceeds through a stack of VSS (combining depthwise convolution and state-space 2D scanning for global context), and the outputs, alongside original and bypassed features, are concatenated before a final convolution (Wang et al., 4 Dec 2024).
Coarse-to-Fine Calibration (DualCP): In the domain-incremental learning scenario, the C2f calibrator consists of sequential multilayer perceptrons. The initial "coarse" MLP aligns image embeddings to a set of coarse-grained semantic prototypes; its output, concatenated with the original embedding, is further refined by category- or group-specific "fine" MLPs before cross-level alignment via dot-regression losses (Wang et al., 23 Mar 2025).
ACFM (Context-aware Cross-level Fusion, C2F-Net): For feature fusion, as in camouflaged object detection, the Coarse-to-Fine module (ACFM) fuses features from different depths via upsampling, summing, and a learnable attention gating informed by multi-scale channel attention (MSCA). The attention map adaptively weights deep versus shallow features, and a convolutional head finalizes the fusion (Sun et al., 2021, Chen et al., 2022).
Diffusion Models (C2F-DFT): In diffusion architectures, a two-stage optimization is imposed: first, a Transformer-based network learns to denoise via self-attention (the "coarse" phase, focusing on global noise), then, after synthetic sampling using the coarse model, a "fine" phase applies supervised refinement at the final image-level with a direct reconstruction loss (Wang et al., 2023).

2. Mathematical Formulation and Module Workflow

C2f modules always entail a staged sequence of mappings or transformations:

Generic C2f Block (CNN Example)

Given input $X\in\mathbb{R}^{C\times H\times W}$ ,

Initial projection: $U_0 = \sigma(W_0 * X)$
$n$ stacked lightweight conv blocks: $U_i = \sigma(W_i * U_{i-1})$ , $i=1...n$
Concatenation: $S = \mathrm{Concat}[U_0, U_1, ..., U_n]$
Output: $\mathrm{out} = \sigma(W_f * S)$

This diversifies feature paths and supports improved backpropagation, as the total derivative w.r.t $X$ sums over each branch.

C2F Calibration (MLP-based)

Given frozen feature $x$ , and group prototypes $\mathbf{e}_{C,i}, \mathbf{e}_{F_{i},j}$ :

Coarse alignment: $\mathbf{x}_C = g_C(x; \varphi_C)$
Fine alignment: $\mathbf{x}_F = g_{F_i}([x, \mathbf{x}_C]; \varphi_{F_i})$
Dual dot-regression loss:

$\mathcal{L}_{\text{DDR}} = \alpha (\mathbf{x}_C^\top \mathbf{e}_{C,i} - 1)^2 + (1-\alpha) (\mathbf{x}_F^\top \mathbf{e}_{F_{i},j} - 1)^2$

Coarse-to-Fine Diffusion

Stage 1: $\mathcal{L}_{\mathrm{coarse}} = \mathbb{E}\!\left[\|\varepsilon_t - \varepsilon_\theta(x_t, y, t)\|_1\right]$
Stage 2: $\mathcal{L}_{\mathrm{fine}} = \lambda [1 - \mathrm{SSIM}(\hat x, x_0)] + (1-\lambda)\|\hat x - x_0\|_1$

3. Applications and Empirical Evidence

The C2f approach pervades diverse domains:

Object Detection (YOLOv5s-C2f): The C2f module, with increased branching and concatenated intermediate features, boosts [email protected] from 77.8% (C3 baseline) to 79.8% with only a marginal increase in parameter count and inference time. Combined with SPPF-CSP (YOLOv5s-Straw), mAP exceeds 80%, with real-time latency preserved (He et al., 2023).
Vision State-Space Fusion (Fab-ME): Replacing C2f modules with C2f-VMamba in the neck of YOLOv8s increases [email protected] by up to 2.5% per block replaced, without perceptible loss in processing speed; the module maintains 60 fps inference on 640x640 images (Wang et al., 4 Dec 2024).
Domain-Incremental Learning: The DualCP C2f calibrator aligns features to both coarse- and fine-level prototypes, preventing catastrophic forgetting and enabling tight semantic clustering in new domains; optimal hyperparameter $\alpha$ for coarse-fine loss balancing yields maximal transfer accuracy (Wang et al., 23 Mar 2025).
Camouflaged/Object Segmentation: Context-aware fusion modules implementing a coarse-to-fine attention blend (ACFM) outperform alternatives on COD10K, CAMO, and similar datasets, improving $F^w$ by more than 2% over baselines (Sun et al., 2021, Chen et al., 2022).
Diffusion Image Restoration: The C2F-DFT model, via hierarchical training and sampling, surpasses previous diffusion models on image deraining/deblurring/denoising; the fine-stage $\mathcal{L}_1$ + SSIM loss corrects for noise misestimation in the coarse phase (Wang et al., 2023).

4. Optimization Strategies and Training Protocols

C2f networks typically require staged or composite optimization:

CNN C2f Blocks: Trained end-to-end by standard SGD or Adam-type optimizers, with cross-entropy for detection/classification.
Calibrator Modules: The coarse and fine MLPs in DualCP's C2f calibrator are optimized using dot-product regression losses, with SGD, batch-size 128, weight decay, and cosine LR schedule. Empirical ablation reveals two-layer MLPs and a hidden size matching input dimensionality are optimal (Wang et al., 23 Mar 2025).
Diffusion/Generative Models: Coarse (noise) and fine (image/feature) refinement stages are optimized sequentially; a frozen coarse model seeds fine-stage gradients (Wang et al., 2023).
C2f in Summarization (C2F-FAR): Unsupervised coarse segmentation based on sentence embeddings is followed by fine-level re-ranking, with only segmentation thresholds and selection sizes tuned (Lu et al., 2023).

5. Computational Complexity and Efficiency

Tables, where appropriate, clarify the parameter, FLOP, and speed profile:

Module	Parameters	Speed Change	mAP/IoU Improvement
YOLOv5s-C3	baseline	baseline	77.8% ([email protected])
YOLOv5s-C2f	+4.8% per blk	+1.9 ms/img	+2.0% ([email protected])
YOLOv8s	11.10 M	21 ms/img	57.4% ([email protected])
Fab-ME (C2f-VMamba)	~11.0 M	18 ms/img	59.0% ([email protected])

The C2f approach improves accuracy and robustness at negligible computational overhead. Empirically, multiple datasets report $+1.5$ to $+3$ point mAP gains for $\leq$ 5% parameter and runtime increase per block (He et al., 2023, Wang et al., 4 Dec 2024).

6. Extensions and Domain-specific Adaptations

C2f modules demonstrate versatility:

Trajectory Prediction: C2F-TP employs a coarse spatial-temporal interaction model to sample from a multimodal trajectory distribution, followed by a conditional denoising module, resulting in improved uncertainty reduction and predictive accuracy (Wang et al., 17 Dec 2024).
Space Grounding (VLMs): C2F-Space leverages a VLM-based coarse mask (from grid-based and propose-validate prompting) refined by superpixel-level residual learning, delivering a 78% success rate in spatial instruction following—over 20% higher than baselines (Oh et al., 19 Nov 2025).
Amodal Segmentation: C2F-Seg models the object mask in progressively more detailed representations, combining a vector-quantized latent code coarse step with a convolutional fine-stage, showing gains in both occluded and full-region IoU on KINS/COCO-A (Gao et al., 2023).

7. Theoretical and Empirical Rationale

C2f modules provide several interconnected benefits:

Gradient Path Diversity: Split-and-merge structures in C2f (YOLO/CSP variants) enhance gradient flow and alleviate vanishing-gradient phenomena in deep stacks.
Representation Hierarchy: Explicitly separating coarse and fine feature extraction supports robust generalization and transfer, capturing both global context (coarse) and localized detail (fine).
Data- or Task-Driven Refinement: Domains with high multi-modality or uncertainty (image denoising, trajectory forecasting) benefit from a staged pipeline where an initial guess is methodically refined conditioned on learned context.
Empirical Robustness: Across object detection, COD, and generative modeling, C2f modules consistently outperform single-stage or non-attention alternatives, both qualitatively (boundary sharpness, region consistency) and quantitatively.

In summary, the C2f module is a general, empirically validated architectural paradigm for hierarchical, staged, or multi-level learning within neural models across vision, language, and time-series domains (He et al., 2023, Wang et al., 4 Dec 2024, Wang et al., 23 Mar 2025, Sun et al., 2021, Chen et al., 2022, Wang et al., 2023, Oh et al., 19 Nov 2025, Gao et al., 2023, Wang et al., 17 Dec 2024, Lu et al., 2023).