Papers
Topics
Authors
Recent
2000 character limit reached

DMG-YOLO: Dual-branch & Multi-scale Detector

Updated 28 November 2025
  • DMG-YOLO comprises two distinct frameworks that use motion-guided and dual-branch architectures to detect tiny, low-contrast objects.
  • It integrates custom modules such as MFEM, BFM, and GLAFPN, combining global and local feature cues with adaptive multi-scale fusion.
  • Empirical evaluations show significant AP improvements and parameter efficiency over standard YOLO variants in drone and remote sensing applications.

DMG-YOLO (“Dual-branch and Multi-scale Guided YOLO”) refers to a series of lightweight, real-time object detectors with custom backbone and feature-fusion designs tailored for dense small-object detection. Notably, the term “DMG-YOLO” denotes two distinct frameworks in the literature: (1) the Difference-Map-Guided YOLO (also called YOLOMG) for drone-to-drone detection using pixel-level motion fusion (Guo et al., 10 Mar 2025), and (2) a lightweight detector for remote sensing images employing dual-branch and multi-scale modules (Wang et al., 21 Nov 2025). Both architectures are characterized by modules that target the discovery of small, low-contrast, or highly occluded objects and introduce architectural innovations that extend the standard YOLO family. The following sections review the main variants, module designs, training strategies, and empirical results from these works.

1. Architectural Innovations

1.1 Difference-Map-Guided YOLO (YOLOMG)

YOLOMG (Guo et al., 10 Mar 2025) integrates a three-stage pipeline:

  • Motion Feature Enhancement Module (MFEM): Generates a motion difference map (MtM_t) by aligning three consecutive grayscale frames through estimated homographies, followed by pixelwise differencing and noise filtering.
  • Bimodal Fusion Module (BFM): Extracts parallel representations from RGB and motion maps via shallow CNNs, fuses them adaptively using a learned attention weight α\alpha and CBAM (Convolutional Block Attention Module).
  • Enhanced YOLOv5 Backbone & Neck: A lightweight, pruned CSP-based backbone feeds into a modified Feature Pyramid Network (FPN) with an additional small-object head that preserves high spatial resolution. Four detection heads (P2–P5) process multi-scale features.

1.2 Lightweight DMG-YOLO for Remote Sensing

The DMG-YOLO of (Wang et al., 21 Nov 2025) is structured into:

  • Backbone with Dual-branch Feature Extraction (DFE): Channel-split input passes through parallel branches:
  • Multi-scale Feature Fusion (MFF): Dilated convolutions at rates 1, 3, 5 enable a large effective receptive field with fine spatial detail, replacing conventional spatial-pooling modules.
  • Global and Local Aggregate Feature Pyramid Network (GLAFPN): Introduces Global-Local Feature Fusion (GLFF) modules at each pyramid scale, combining local and global spatial attention (GLSA) to strengthen small-object cues.

2. Core Module Designs

2.1 Motion Difference Map in YOLOMG

For pixel-level motion guidance, three temporally adjacent frames Itk,It,It+kI_{t-k}, I_t, I_{t+k} are spatially registered via homography transformations HH and HH', estimated using LK keypoints and RANSAC:

  • Compensated frames: I^tk=HItk\hat{I}_{t-k} = H I_{t-k}, I^t+k=HIt+k\hat{I}_{t+k} = H' I_{t+k}
  • Difference map: Et=ItI^tk+ItI^t+k2=MtE_t = \frac{ |I_t - \hat{I}_{t-k}| + |I_t - \hat{I}_{t+k}| }{2 } = M_t

Morphological open/close operations remove noise. The resulting MtR1×H×WM_t \in \mathbb{R}^{1 \times H \times W} is processed in the fusion module.

2.2 Bimodal Fusion with Adaptive Weighting and CBAM

Feature extraction yields FRGBF_{RGB} and FMF_{M}; fusion proceeds via:

  • Adaptive weights: W=Conv1×1([FRGB;FM])W = \mathrm{Conv}_{1 \times 1}( [F_{RGB}; F_M] ), α=σ(W)\alpha = \sigma(W)
  • Blending: Fmix=αFRGB+(1α)FMF_{mix} = \alpha \odot F_{RGB} + (1-\alpha) \odot F_M
  • Channel and spatial attention: CBAM refines FmixFfusedF_{mix} \rightarrow F_{fused} with:
    • Channel attention: Mc(Fmix)=σ(MLP(AvgPool(Fmix))+MLP(MaxPool(Fmix)))M_c(F_{mix}) = \sigma( \mathrm{MLP}( \mathrm{AvgPool}(F_{mix}) ) + \mathrm{MLP}( \mathrm{MaxPool}(F_{mix}) ) )
    • Spatial attention: Ms(F)=σ(Conv7×7([AvgPoolc(F),MaxPoolc(F)]))M_s(F') = \sigma( \mathrm{Conv}_{7 \times 7}( [ \mathrm{AvgPool}_c(F'), \mathrm{MaxPool}_c(F') ] ) )
    • Output: Ffused=Ms(F)FF_{fused} = M_s(F') \odot F'

2.3 DFE and MFF Modules in Remote Sensing DMG-YOLO

  • DFE module: Channel split, with:
    • Floc_out=Flocal+DWConv2(DWConv1(Flocal))F_{loc\_out} = F_{local} + \mathrm{DWConv}_2( \mathrm{DWConv}_1(F_{local}) )
    • Global path via CGLU-ViT: U=WuXU = W_u X, V=WvXV = W_v X, V=DWConv3×3(ReLU(V))V' = \mathrm{DWConv}_{3 \times 3}( \mathrm{ReLU}(V) ), Gate=σ(V)\mathrm{Gate} = \sigma(V'), Output=UGate+U\mathrm{Output} = U \odot \mathrm{Gate} + U
    • Concat and 1×1 conv fuses features.
  • MFF module: For FinF_{in}, apply:
    • F=Conv1×1(Fin)F' = \mathrm{Conv}_{1 \times 1}(F_{in})
    • Dilated stack: D1=Conv3×3d=1(F)D_1 = \mathrm{Conv}_{3 \times 3}^{d=1}(F'), D3=Conv3×3d=3(D1)D_3 = \mathrm{Conv}_{3 \times 3}^{d=3}(D_1), D5=Conv3×3d=5(D3)D_5 = \mathrm{Conv}_{3 \times 3}^{d=5}(D_3)
    • Fres=F+D5F_{res} = F' + D_5; output Fout=Conv1×1(Fres)F_{out} = \mathrm{Conv}_{1 \times 1}(F_{res})
  • GLAFPN: For each feature FiF_i, channels split and processed via two GLSA modules per branch, then recombined:
    • Global spatial attention: AttG(F1)=Softmax((Conv1×1(F1))T)Att_G(F_1) = \mathrm{Softmax}((\mathrm{Conv}_{1 \times 1}(F_1))^T), GSA(F1)=MLP(AttG(F1)F1)+F1GSA(F_1) = \mathrm{MLP}(Att_G(F_1) \cdot F_1) + F_1
    • Local spatial attention analogously forms LSA(F2)LSA(F_2)
    • Final aggregation via 1×1 convolution.

3. Training Protocols and Implementation

3.1 Dataset and Input

  • YOLOMG: Trained and evaluated on ARD100 (“100 videos, 202,467 frames, average object <0.01%< 0.01\% frame area”), NPS-Drones (50 videos, 70,250 frames), also Drone-vs-Bird and low-light sequences for analysis (Guo et al., 10 Mar 2025).
  • Remote Sensing DMG-YOLO: VisDrone2019 and NWPU VHR-10, with “8:1:1” train/val/test; resolution 640×640 (Wang et al., 21 Nov 2025).

3.2 Augmentation and Optimization

  • YOLOMG: Standard YOLOv5 augments (mosaic, mixup, HSV jitter, flip), Adam (lr=0.01lr=0.01, momentum $0.937$), 100 epochs, batch size 8, MS-COCO pretraining.
  • Remote Sensing: Standard YOLOv8 augmentations, SGD (lr=0.001lr=0.001), 200 epochs, convergence observed, PyTorch 2.0, CUDA 12.0, RTX 4090.

4. Empirical Performance and Ablations

4.1 Quantitative Results

  • YOLOMG:
    • ARD100 (640×640640 \times 640): YOLOv5 baseline AP50_{50} = 0.53, YOLOMG-640 AP50_{50} = 0.78 (++25 pp); precision 0.83, recall 0.71, FPS 133. YOLOMG-1280 AP50_{50} = 0.85, P 0.90, R 0.74, FPS 35.
    • NPS-Drones (1280×12801280 \times 1280): YOLOv5s AP50_{50} = 0.93, YOLOMG-1280 AP50_{50} = 0.95. Parity with TransVisDrone 34.
    • Substantial AP gain for extremely small, low-contrast targets in complex scenes.
  • Remote Sensing DMG-YOLO:
    • VisDrone2019: DMG-YOLO mAP50_{50} = 38.8%, YOLOv11n = 34.2%, YOLOv11s = 39.6% (with 9.4 M vs 2.1 M params). Model FLOPs = 12.4G (Wang et al., 21 Nov 2025).
    • NWPU VHR-10: DMG-YOLO mAP50_{50} = 92.4%, YOLOv11n = 83.8%, YOLOv11s = 91.7%.
    • Ablation (VisDrone2019): Baseline 33.6%, +DFE 34.4%, +MFF 35.0%, full +GLAFPN 38.8%; DFE enabled both mAP and parameter reduction.
Model Params (M) FLOPs (G) VisDrone mAP₅₀ (%)
YOLOv8n 3.0 8.2 33.6
YOLOv11n 2.6 8.6 34.2
DMG-YOLO 2.1 12.4 38.8
YOLOv11s 9.4 30.0+ 39.6

4.2 Qualitative and Failure Case Analysis

  • YOLOMG: Successfully detects tiny and low-contrast drones under clutter, including urban, low-light, and drone-vs-bird domains. Major failure points—hovering/slow targets (weak motion cues), and false positives on birds/cars with similar appearance and motion (Guo et al., 10 Mar 2025).
  • Remote Sensing DMG-YOLO: Shows improved detection of small and densely packed objects compared to YOLOv8n (Wang et al., 21 Nov 2025).

5. Comparative Analysis and Significance

Both DMG-YOLO frameworks introduce mechanisms—to fuse motion cues or aggregate global-local features—specially designed to address small-object detection, an enduring challenge for single-stage detectors. In YOLOMG, integration of a motion-guided difference map produced class-leading improvements for drone surveillance in unconstrained conditions. The remote sensing DMG-YOLO leverages transformer-based context and multi-scale fusion (without motion cues) to elevate mAP for small targets at minimal parameter cost.

A plausible implication is that both motion priors (temporal, pixel-level guidance) and multi-branch global-local aggregation can be synergistically combined for further improvement in complex environments characterized by real-time constraints and small-object prevalence.

6. Limitations and Open Challenges

  • YOLOMG: Susceptible to failure when motion cues are weak (e.g., hovering objects or slow movement), or where background motion mimics the appearance/motion of true targets. FP increase on non-drone moving objects in cluttered scenes.
  • Remote Sensing DMG-YOLO: The introduction of transformers and multi-branch modules increases FLOPs (from 8.2 G for YOLOv8n to 12.4 G for DMG-YOLO-n); model speed (FPS) and storage size in MB are not reported (Wang et al., 21 Nov 2025).
  • Precision and recall metrics are not universally provided, limiting head-to-head comparison beyond mAP50_{50}.

7. Future Directions

Emerging directions include: (1) joint use of external priors (e.g., radar/thermal/motion) and learned visual features, (2) low-rank or quantized variants for resource-constrained platforms, (3) targeted robustness to camouflaged or ultra-tiny objects, and (4) more diverse multi-modal fusion architectures, combining the strengths of both motion-guided and transformer-fused pathways as exemplified in the two DMG-YOLO variants. Further benchmarking on real-world, diverse, and adversarial datasets—in both drone and remote-sensing domains—remains necessary to fully map the strengths and systemic limitations of these architectures.


For further details, see “YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion” (Guo et al., 10 Mar 2025) and “A lightweight detector for real-time detection of remote sensing images” (Wang et al., 21 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DMG-YOLO.