ConvNeXt-tiny RGB Branch Overview
- RGB Branch based on ConvNeXt-tiny is a convolutional backbone that extracts hierarchical global features using depthwise convolutions, GELU activations, and channel-wise normalization.
- It adapts to varied tasks such as deepfake detection, manipulation localization, image classification, 3D object detection, and image dehazing with task-specific preprocessing and fusion strategies.
- Its flexible design allows for modifications in input processing, network architecture, and loss functions, ensuring robust performance while managing computational costs.
An RGB branch based on ConvNeXt-tiny represents a state-of-the-art convolutional backbone for extracting global visual features from RGB images. This architectural component appears across diverse tasks such as deepfake detection, manipulation localization, image classification, 3D object detection, and image dehazing. It leverages ConvNeXt’s hierarchical spatial feature extraction, depthwise separable convolutions, GELU activations, and channel-wise LayerNorm. Variations in input resolution, branching, attention fusion, loss supervision, and training regimes adapt this core design to the specific requirements of each application. The following sections synthesize the technical characteristics, integration paradigms, and empirical justifications for the RGB branch with ConvNeXt-tiny backbone.
1. Core Architecture and Computational Blocks
All reviewed works employ the ConvNeXt-tiny framework, which decomposes the image into progressively abstract spatial features through a patch embedding stem and four successive stages of inverted-bottleneck blocks. The standard configuration comprises 3, 3, 9, 3 residual blocks in stages 1–4, with channel widths 96, 192, 384, 768. Each block combines:
- Channel-wise LayerNorm.
- Depthwise 7×7 (or 3×3) convolution (groups = input channels).
- Pointwise 1×1 convolution for channel expansion (expansion ratio 4), followed by GELU, then 1×1 projection back to base width.
- Residual addition.
The pseudocode signature for a canonical ConvNeXt block is:
1 2 3 4 5 6 |
x₀ = LayerNorm(x) x₁ = DepthwiseConv7×7(groups=C)(x₀) x₂ = Conv1×1(C→4C)(x₁) x₃ = GELU(x₂) x₄ = Conv1×1(4C→C)(x₃) y = x + x₄ |
2. Input Processing and Data Augmentation
The RGB branch’s input protocol typically includes:
- Precise spatial preprocessing: Face alignment and center cropping to 224×224 (ForensicFlow (Romani, 18 Nov 2025)), resizing to 256×256 (Noise and Edge dual-branch (Dagar et al., 2 Sep 2024)), or, for object detection, 480×480 (MonoNext (Pederiva et al., 2023)).
- Channel-wise normalization using ImageNet mean and standard deviation statistics.
- Training augmentations including random horizontal flipping (p=0.5), brightness and contrast jitter, minor geometric transforms (rotation ±5°, resized crops 0.9–1.0), and task-specific MixUp (EVCC) (Hasan et al., 24 Nov 2025).
- Input tensor shape post-augmentation is either B×3×224×224 or B×3×H×W depending on application.
3. Output, Embedding, and Decoder Stages
After the final ConvNeXt stage, the feature map is post-processed to yield task-specific embeddings:
| Downstream Task | Output After ConvNeXt | Projection/Pooling | Final Shape |
|---|---|---|---|
| Video Deepfake Detection | 7×7×768 | GAP → Linear (768→512) | ℝ⁵¹² |
| Manipulation Localization | 8×8×768 | Upsample+FeatureEnh. Module | 8×8×768 → pixel mask |
| Image Classification | 14×14×512 | Flatten → Linear (512→384) | ℝ³⁸⁴ |
| 3D Object Detection (BEV) | 15×15×128* | Head-specific convolutions | 15×15×(task channels) |
| Non-homogeneous Dehazing | Variable† | U-Net style upsampling | H×W×3 |
* After four ConvNeXt-style blocks post-MobileNetV2 backbone; † after upsampling via three PixelShuffle+attention blocks.
Global average pooling is standard for video and classification tasks; spatial upsampling/decoding branches dominate in pixel-level localization or regression settings (Romani, 18 Nov 2025, Dagar et al., 2 Sep 2024, Hasan et al., 24 Nov 2025, Pederiva et al., 2023, Zhou et al., 2023).
4. Fusion Strategies and Attention Mechanisms
Fusion paradigms adapt the RGB branch output for multi-branch architectures:
- Temporal Attention: ForensicFlow generates per-frame embeddings, then aggregates K-frame segments using attention-driven pooling:
with from an MLP + softmax (Romani, 18 Nov 2025).
- Multi-Modal/Branch Fusion: Adaptive attention weighting for late fusion, as in ForensicFlow and EVCC:
with produced by a learnable softmax or, in EVCC, by a fusion router with confidence modulation (Romani, 18 Nov 2025, Hasan et al., 24 Nov 2025).
- Cross-Attention: EVCC applies gated, bidirectional cross-attention between ConvNeXt and Transformer tokens after token pruning, refining semantic alignment (Hasan et al., 24 Nov 2025).
- Spatial Fusion: Manipulation localization and dehazing architectures sum or blend branch outputs per-pixel, optionally after applying attention or learned blending maps (Dagar et al., 2 Sep 2024, Zhou et al., 2023).
5. Loss Functions and Optimization
Each network configures loss and training schedules to suit the detection, localization, or generative target:
- Global/Binary Tasks (e.g., Deepfake Detection): Focal Loss with for severe class imbalance (Romani, 18 Nov 2025).
- Segmentation/Localization: Joint supervision with BCE, Focal, and Edge Dice losses targeting both per-pixel masks and boundary localization:
where the edge loss explicitly sharpens structure awareness (Dagar et al., 2 Sep 2024).
- Regression/Image Restoration: Per-branch L1 and perceptual (VGG) loss, MS-SSIM for structure, and adversarial GAN losses for photo quality (in dehazing) (Zhou et al., 2023).
- Detection: Sum of confidence, class, box (IoU and angle) losses, with explicit loss weights for each aspect as in 3D object detection (Pederiva et al., 2023).
- Common optimization settings include AdamW/Adam, learning rates (10⁻⁴, 2×10⁻⁵), learning rate decay (per-unfreeze or schedule), and progressive unfreezing for transfer learning (Romani, 18 Nov 2025, Hasan et al., 24 Nov 2025).
6. Architectural Adaptations and Contextual Significance
Task-dependent modifications include:
- Downsampling Variations: Input size and patch/stem stride redefine downsampled feature map dimensions (e.g., 64×64@96 for 256×256 input (Dagar et al., 2 Sep 2024); 7×7@768 for 224×224 (Romani, 18 Nov 2025)).
- Stage Truncation and Lightweighting: For efficiency, some branches omit final stages (e.g., Dehazing omits Stage 4 (Zhou et al., 2023)) or shrink width/depth factors (EVCC) (Hasan et al., 24 Nov 2025).
- Branch-Specific Heads: Detection/Restoration tasks append task-tuned decoders over the ConvNeXt output, while classification or video understanding ultimately reduce to global embeddings (Romani, 18 Nov 2025, Pederiva et al., 2023).
- Token Pruning and Late Routing: In hybrid networks such as EVCC, adaptive token pruning halves the length of the ConvNeXt output sequence and fuses it with other modalities via dynamic gates, reducing FLOPs by 25–35% while maintaining accuracy (Hasan et al., 24 Nov 2025).
- Attention and Feature Enhancement: Modules such as Feature Enhancement (F_enh), per-pixel blending weights, and multi-scale channel attention refine spatial detail prior to fusion or decoding (Dagar et al., 2 Sep 2024, Zhou et al., 2023).
7. Summary Table: RGB Branch (ConvNeXt-tiny) Design Across Representative Systems
| System (arXiv) | Input Size | Depths (Blocks) | Widths | Output Embedding | Key Fusion/Decoder | Main Losses |
|---|---|---|---|---|---|---|
| ForensicFlow (Romani, 18 Nov 2025) | 224×224 | [3,3,9,3] | [96,192,384,768] | 512-dim (per frame) | Attention pooling, tri-modal fusion | Focal (α,γ) |
| Noise+Edge (Dagar et al., 2 Sep 2024) | 256×256 | [3,3,9,3] | [96,192,384,768] | 8×8×768 | FE module, pixel-wise fusion | BCE, Focal, Edge |
| EVCC (Hasan et al., 24 Nov 2025) | 224×224 | [2,2,6,2] | [64,128,256,512] | 384-dim | Pruning, cross-attention, router | Cross-entropy |
| MonoNext (Pederiva et al., 2023) | 480×480 | 4 ConvNeXt blocks* | [512,256,256,128] | 15×15×128 (BEV grid) | 5 heads: class/box/yaw/score | L2, IoU, angle |
| Dehazing (Zhou et al., 2023) | H×W | [3,3,9] | [96,192,384] | Variable spatial maps | U-Net upsampling, per-pixel blend | L1, perceptual, MS-SSIM |
* After MobileNetV2 backbone.
References
- ForensicFlow (Romani, 18 Nov 2025)
- Noise and Edge dual-branch (Dagar et al., 2 Sep 2024)
- EVCC (Hasan et al., 24 Nov 2025)
- MonoNext (Pederiva et al., 2023)
- Breaking Through the Haze (Zhou et al., 2023)
The ConvNeXt-tiny-based RGB branch establishes a robust, extensible foundation for integrating global spatial semantic information in vision models, and its variants, by virtue of carefully balanced architectural hyperparameters and task-adaptive fusion strategies, demonstrate strong empirical performance across detection, localization, recognition, and restoration modalities.