ConvNeXt-tiny RGB Branch Overview

Updated 25 November 2025

RGB Branch based on ConvNeXt-tiny is a convolutional backbone that extracts hierarchical global features using depthwise convolutions, GELU activations, and channel-wise normalization.
It adapts to varied tasks such as deepfake detection, manipulation localization, image classification, 3D object detection, and image dehazing with task-specific preprocessing and fusion strategies.
Its flexible design allows for modifications in input processing, network architecture, and loss functions, ensuring robust performance while managing computational costs.

An RGB branch based on ConvNeXt-tiny represents a state-of-the-art convolutional backbone for extracting global visual features from RGB images. This architectural component appears across diverse tasks such as deepfake detection, manipulation localization, image classification, 3D object detection, and image dehazing. It leverages ConvNeXt’s hierarchical spatial feature extraction, depthwise separable convolutions, GELU activations, and channel-wise LayerNorm. Variations in input resolution, branching, attention fusion, loss supervision, and training regimes adapt this core design to the specific requirements of each application. The following sections synthesize the technical characteristics, integration paradigms, and empirical justifications for the RGB branch with ConvNeXt-tiny backbone.

1. Core Architecture and Computational Blocks

All reviewed works employ the ConvNeXt-tiny framework, which decomposes the image into progressively abstract spatial features through a patch embedding stem and four successive stages of inverted-bottleneck blocks. The standard configuration comprises 3, 3, 9, 3 residual blocks in stages 1–4, with channel widths 96, 192, 384, 768. Each block combines:

Channel-wise LayerNorm.
Depthwise 7×7 (or 3×3) convolution (groups = input channels).
Pointwise 1×1 convolution for channel expansion (expansion ratio 4), followed by GELU, then 1×1 projection back to base width.
Residual addition.

The pseudocode signature for a canonical ConvNeXt block is:

x₀ = LayerNorm(x)
x₁ = DepthwiseConv7×7(groups=C)(x₀)
x₂ = Conv1×1(C→4C)(x₁)
x₃ = GELU(x₂)
x₄ = Conv1×1(4C→C)(x₃)
y = x + x₄

Some works reduce depth/width for efficiency: for example, EVCC employs depths [2,2,6,2], widths [64,128,256,512], yielding lower computational cost while preserving the design philosophy (Hasan et al., 24 Nov 2025).

2. Input Processing and Data Augmentation

The RGB branch’s input protocol typically includes:

Precise spatial preprocessing: Face alignment and center cropping to 224×224 (ForensicFlow (Romani, 18 Nov 2025)), resizing to 256×256 (Noise and Edge dual-branch (Dagar et al., 2 Sep 2024)), or, for object detection, 480×480 (MonoNext (Pederiva et al., 2023)).
Channel-wise normalization using ImageNet mean and standard deviation statistics.
Training augmentations including random horizontal flipping (p=0.5), brightness and contrast jitter, minor geometric transforms (rotation ±5°, resized crops 0.9–1.0), and task-specific MixUp (EVCC) (Hasan et al., 24 Nov 2025).
Input tensor shape post-augmentation is either B×3×224×224 or B×3×H×W depending on application.

3. Output, Embedding, and Decoder Stages

After the final ConvNeXt stage, the feature map is post-processed to yield task-specific embeddings:

Downstream Task	Output After ConvNeXt	Projection/Pooling	Final Shape
Video Deepfake Detection	7×7×768	GAP → Linear (768→512)	ℝ⁵¹²
Manipulation Localization	8×8×768	Upsample+FeatureEnh. Module	8×8×768 → pixel mask
Image Classification	14×14×512	Flatten → Linear (512→384)	ℝ³⁸⁴
3D Object Detection (BEV)	15×15×128*	Head-specific convolutions	15×15×(task channels)
Non-homogeneous Dehazing	Variable†	U-Net style upsampling	H×W×3

* After four ConvNeXt-style blocks post-MobileNetV2 backbone; † after upsampling via three PixelShuffle+attention blocks.

Global average pooling is standard for video and classification tasks; spatial upsampling/decoding branches dominate in pixel-level localization or regression settings (Romani, 18 Nov 2025, Dagar et al., 2 Sep 2024, Hasan et al., 24 Nov 2025, Pederiva et al., 2023, Zhou et al., 2023).

4. Fusion Strategies and Attention Mechanisms

Fusion paradigms adapt the RGB branch output for multi-branch architectures:

Temporal Attention: ForensicFlow generates per-frame embeddings, then aggregates K-frame segments using attention-driven pooling:

$F^{RGB}_{video} = \sum_{t=1}^K \alpha_t \cdot f_t^{RGB}, \quad \alpha_t = \frac{\exp(e_t)}{\sum_{j=1}^K \exp(e_j)}$

with $e_t$ from an MLP + softmax (Romani, 18 Nov 2025).

Multi-Modal/Branch Fusion: Adaptive attention weighting for late fusion, as in ForensicFlow and EVCC:

$F_{fused} = \gamma_{RGB} F_{RGB} + \gamma_{Tex} F_{Tex} + \gamma_{Freq} F_{Freq}$

with $\gamma$ produced by a learnable softmax or, in EVCC, by a fusion router with confidence modulation (Romani, 18 Nov 2025, Hasan et al., 24 Nov 2025).

Cross-Attention: EVCC applies gated, bidirectional cross-attention between ConvNeXt and Transformer tokens after token pruning, refining semantic alignment (Hasan et al., 24 Nov 2025).
Spatial Fusion: Manipulation localization and dehazing architectures sum or blend branch outputs per-pixel, optionally after applying attention or learned blending maps (Dagar et al., 2 Sep 2024, Zhou et al., 2023).

5. Loss Functions and Optimization

Each network configures loss and training schedules to suit the detection, localization, or generative target:

Global/Binary Tasks (e.g., Deepfake Detection): Focal Loss with $α=1.0, γ=2.0$ for severe class imbalance (Romani, 18 Nov 2025).
Segmentation/Localization: Joint supervision with BCE, Focal, and Edge Dice losses targeting both per-pixel masks and boundary localization:

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{BCE}} + \mathcal{L}_{\mathrm{FL}} + \mathcal{L}_{\mathrm{edge}}$

where the edge loss explicitly sharpens structure awareness (Dagar et al., 2 Sep 2024).

Regression/Image Restoration: Per-branch L1 and perceptual (VGG) loss, MS-SSIM for structure, and adversarial GAN losses for photo quality (in dehazing) (Zhou et al., 2023).
Detection: Sum of confidence, class, box (IoU and angle) losses, with explicit loss weights for each aspect as in 3D object detection (Pederiva et al., 2023).
Common optimization settings include AdamW/Adam, learning rates (10⁻⁴, 2×10⁻⁵), learning rate decay (per-unfreeze or schedule), and progressive unfreezing for transfer learning (Romani, 18 Nov 2025, Hasan et al., 24 Nov 2025).

6. Architectural Adaptations and Contextual Significance

Task-dependent modifications include:

Downsampling Variations: Input size and patch/stem stride redefine downsampled feature map dimensions (e.g., 64×64@96 for 256×256 input (Dagar et al., 2 Sep 2024); 7×7@768 for 224×224 (Romani, 18 Nov 2025)).
Stage Truncation and Lightweighting: For efficiency, some branches omit final stages (e.g., Dehazing omits Stage 4 (Zhou et al., 2023)) or shrink width/depth factors (EVCC) (Hasan et al., 24 Nov 2025).
Branch-Specific Heads: Detection/Restoration tasks append task-tuned decoders over the ConvNeXt output, while classification or video understanding ultimately reduce to global embeddings (Romani, 18 Nov 2025, Pederiva et al., 2023).
Token Pruning and Late Routing: In hybrid networks such as EVCC, adaptive token pruning halves the length of the ConvNeXt output sequence and fuses it with other modalities via dynamic gates, reducing FLOPs by 25–35% while maintaining accuracy (Hasan et al., 24 Nov 2025).
Attention and Feature Enhancement: Modules such as Feature Enhancement (F_enh), per-pixel blending weights, and multi-scale channel attention refine spatial detail prior to fusion or decoding (Dagar et al., 2 Sep 2024, Zhou et al., 2023).

7. Summary Table: RGB Branch (ConvNeXt-tiny) Design Across Representative Systems

System (arXiv)	Input Size	Depths (Blocks)	Widths	Output Embedding	Key Fusion/Decoder	Main Losses
ForensicFlow (Romani, 18 Nov 2025)	224×224	[3,3,9,3]	[96,192,384,768]	512-dim (per frame)	Attention pooling, tri-modal fusion	Focal (α,γ)
Noise+Edge (Dagar et al., 2 Sep 2024)	256×256	[3,3,9,3]	[96,192,384,768]	8×8×768	FE module, pixel-wise fusion	BCE, Focal, Edge
EVCC (Hasan et al., 24 Nov 2025)	224×224	[2,2,6,2]	[64,128,256,512]	384-dim	Pruning, cross-attention, router	Cross-entropy
MonoNext (Pederiva et al., 2023)	480×480	4 ConvNeXt blocks*	[512,256,256,128]	15×15×128 (BEV grid)	5 heads: class/box/yaw/score	L2, IoU, angle
Dehazing (Zhou et al., 2023)	H×W	[3,3,9]	[96,192,384]	Variable spatial maps	U-Net upsampling, per-pixel blend	L1, perceptual, MS-SSIM

* After MobileNetV2 backbone.

References

ForensicFlow (Romani, 18 Nov 2025)
Noise and Edge dual-branch (Dagar et al., 2 Sep 2024)
EVCC (Hasan et al., 24 Nov 2025)
MonoNext (Pederiva et al., 2023)
Breaking Through the Haze (Zhou et al., 2023)

The ConvNeXt-tiny-based RGB branch establishes a robust, extensible foundation for integrating global spatial semantic information in vision models, and its variants, by virtue of carefully balanced architectural hyperparameters and task-adaptive fusion strategies, demonstrate strong empirical performance across detection, localization, recognition, and restoration modalities.