DMG-YOLO: Dual-branch & Multi-scale Detector

Updated 28 November 2025

DMG-YOLO comprises two distinct frameworks that use motion-guided and dual-branch architectures to detect tiny, low-contrast objects.
It integrates custom modules such as MFEM, BFM, and GLAFPN, combining global and local feature cues with adaptive multi-scale fusion.
Empirical evaluations show significant AP improvements and parameter efficiency over standard YOLO variants in drone and remote sensing applications.

DMG-YOLO (“Dual-branch and Multi-scale Guided YOLO”) refers to a series of lightweight, real-time object detectors with custom backbone and feature-fusion designs tailored for dense small-object detection. Notably, the term “DMG-YOLO” denotes two distinct frameworks in the literature: (1) the Difference-Map-Guided YOLO (also called YOLOMG) for drone-to-drone detection using pixel-level motion fusion (Guo et al., 10 Mar 2025), and (2) a lightweight detector for remote sensing images employing dual-branch and multi-scale modules (Wang et al., 21 Nov 2025). Both architectures are characterized by modules that target the discovery of small, low-contrast, or highly occluded objects and introduce architectural innovations that extend the standard YOLO family. The following sections review the main variants, module designs, training strategies, and empirical results from these works.

1. Architectural Innovations

1.1 Difference-Map-Guided YOLO (YOLOMG)

YOLOMG (Guo et al., 10 Mar 2025) integrates a three-stage pipeline:

Motion Feature Enhancement Module (MFEM): Generates a motion difference map ( $M_t$ ) by aligning three consecutive grayscale frames through estimated homographies, followed by pixelwise differencing and noise filtering.
Bimodal Fusion Module (BFM): Extracts parallel representations from RGB and motion maps via shallow CNNs, fuses them adaptively using a learned attention weight $\alpha$ and CBAM (Convolutional Block Attention Module).
Enhanced YOLOv5 Backbone & Neck: A lightweight, pruned CSP-based backbone feeds into a modified Feature Pyramid Network (FPN) with an additional small-object head that preserves high spatial resolution. Four detection heads (P2–P5) process multi-scale features.

1.2 Lightweight DMG-YOLO for Remote Sensing

The DMG-YOLO of (Wang et al., 21 Nov 2025) is structured into:

Backbone with Dual-branch Feature Extraction (DFE): Channel-split input passes through parallel branches:
- Local branch: two depthwise separable convolutions with residual connections.
- Global branch: a lightweight vision transformer with convolutional gated linear units (CGLU-ViT) for global context.
- Channel concatenation and 1×1 convolution recombine features.
Multi-scale Feature Fusion (MFF): Dilated convolutions at rates 1, 3, 5 enable a large effective receptive field with fine spatial detail, replacing conventional spatial-pooling modules.
Global and Local Aggregate Feature Pyramid Network (GLAFPN): Introduces Global-Local Feature Fusion (GLFF) modules at each pyramid scale, combining local and global spatial attention (GLSA) to strengthen small-object cues.

2. Core Module Designs

2.1 Motion Difference Map in YOLOMG

For pixel-level motion guidance, three temporally adjacent frames $I_{t-k}, I_t, I_{t+k}$ are spatially registered via homography transformations $H$ and $H'$ , estimated using LK keypoints and RANSAC:

Compensated frames: $\hat{I}_{t-k} = H I_{t-k}$ , $\hat{I}_{t+k} = H' I_{t+k}$
Difference map: $E_t = \frac{ |I_t - \hat{I}_{t-k}| + |I_t - \hat{I}_{t+k}| }{2 } = M_t$

Morphological open/close operations remove noise. The resulting $M_t \in \mathbb{R}^{1 \times H \times W}$ is processed in the fusion module.

2.2 Bimodal Fusion with Adaptive Weighting and CBAM

Feature extraction yields $F_{RGB}$ and $F_{M}$ ; fusion proceeds via:

Adaptive weights: $W = \mathrm{Conv}_{1 \times 1}( [F_{RGB}; F_M] )$ , $\alpha = \sigma(W)$
Blending: $F_{mix} = \alpha \odot F_{RGB} + (1-\alpha) \odot F_M$
Channel and spatial attention: CBAM refines $F_{mix} \rightarrow F_{fused}$ $F_{mi x} \to F_{f u se d}$ with:
- Channel attention: $M_c(F_{mix}) = \sigma( \mathrm{MLP}( \mathrm{AvgPool}(F_{mix}) ) + \mathrm{MLP}( \mathrm{MaxPool}(F_{mix}) ) )$
- Spatial attention: $M_s(F') = \sigma( \mathrm{Conv}_{7 \times 7}( [ \mathrm{AvgPool}_c(F'), \mathrm{MaxPool}_c(F') ] ) )$
- Output: $F_{fused} = M_s(F') \odot F'$

2.3 DFE and MFF Modules in Remote Sensing DMG-YOLO

DFE module: Channel split, with:
- $F_{loc\_out} = F_{local} + \mathrm{DWConv}_2( \mathrm{DWConv}_1(F_{local}) )$
- Global path via CGLU-ViT: $U = W_u X$ , $V = W_v X$ , $V' = \mathrm{DWConv}_{3 \times 3}( \mathrm{ReLU}(V) )$ , $\mathrm{Gate} = \sigma(V')$ , $\mathrm{Output} = U \odot \mathrm{Gate} + U$
- Concat and 1×1 conv fuses features.
MFF module: For $F_{in}$ $F_{in}$ , apply:
- $F' = \mathrm{Conv}_{1 \times 1}(F_{in})$
- Dilated stack: $D_1 = \mathrm{Conv}_{3 \times 3}^{d=1}(F')$ , $D_3 = \mathrm{Conv}_{3 \times 3}^{d=3}(D_1)$ , $D_5 = \mathrm{Conv}_{3 \times 3}^{d=5}(D_3)$
- $F_{res} = F' + D_5$ ; output $F_{out} = \mathrm{Conv}_{1 \times 1}(F_{res})$
GLAFPN: For each feature $F_i$ $F_{i}$ , channels split and processed via two GLSA modules per branch, then recombined:
- Global spatial attention: $Att_G(F_1) = \mathrm{Softmax}((\mathrm{Conv}_{1 \times 1}(F_1))^T)$ , $GSA(F_1) = \mathrm{MLP}(Att_G(F_1) \cdot F_1) + F_1$
- Local spatial attention analogously forms $LSA(F_2)$
- Final aggregation via 1×1 convolution.

3. Training Protocols and Implementation

3.1 Dataset and Input

YOLOMG: Trained and evaluated on ARD100 (“100 videos, 202,467 frames, average object $< 0.01\%$ frame area”), NPS-Drones (50 videos, 70,250 frames), also Drone-vs-Bird and low-light sequences for analysis (Guo et al., 10 Mar 2025).
Remote Sensing DMG-YOLO: VisDrone2019 and NWPU VHR-10, with “8:1:1” train/val/test; resolution 640×640 (Wang et al., 21 Nov 2025).

3.2 Augmentation and Optimization

YOLOMG: Standard YOLOv5 augments (mosaic, mixup, HSV jitter, flip), Adam ( $lr=0.01$ , momentum $0.937$), 100 epochs, batch size 8, MS-COCO pretraining.
Remote Sensing: Standard YOLOv8 augmentations, SGD ( $lr=0.001$ ), 200 epochs, convergence observed, PyTorch 2.0, CUDA 12.0, RTX 4090.

4. Empirical Performance and Ablations

4.1 Quantitative Results

YOLOMG:
- ARD100 ( $640 \times 640$ ): YOLOv5 baseline AP $_{50}$ = 0.53, YOLOMG-640 AP $_{50}$ = 0.78 ( $+$ 25 pp); precision 0.83, recall 0.71, FPS 133. YOLOMG-1280 AP $_{50}$ = 0.85, P 0.90, R 0.74, FPS 35.
- NPS-Drones ( $1280 \times 1280$ ): YOLOv5s AP $_{50}$ = 0.93, YOLOMG-1280 AP $_{50}$ = 0.95. Parity with TransVisDrone 34.
- Substantial AP gain for extremely small, low-contrast targets in complex scenes.
Remote Sensing DMG-YOLO:
- VisDrone2019: DMG-YOLO mAP $_{50}$ = 38.8%, YOLOv11n = 34.2%, YOLOv11s = 39.6% (with 9.4 M vs 2.1 M params). Model FLOPs = 12.4G (Wang et al., 21 Nov 2025).
- NWPU VHR-10: DMG-YOLO mAP $_{50}$ = 92.4%, YOLOv11n = 83.8%, YOLOv11s = 91.7%.
- Ablation (VisDrone2019): Baseline 33.6%, +DFE 34.4%, +MFF 35.0%, full +GLAFPN 38.8%; DFE enabled both mAP and parameter reduction.

Model	Params (M)	FLOPs (G)	VisDrone mAP₅₀ (%)
YOLOv8n	3.0	8.2	33.6
YOLOv11n	2.6	8.6	34.2
DMG-YOLO	2.1	12.4	38.8
YOLOv11s	9.4	30.0+	39.6

4.2 Qualitative and Failure Case Analysis

YOLOMG: Successfully detects tiny and low-contrast drones under clutter, including urban, low-light, and drone-vs-bird domains. Major failure points—hovering/slow targets (weak motion cues), and false positives on birds/cars with similar appearance and motion (Guo et al., 10 Mar 2025).
Remote Sensing DMG-YOLO: Shows improved detection of small and densely packed objects compared to YOLOv8n (Wang et al., 21 Nov 2025).

5. Comparative Analysis and Significance

Both DMG-YOLO frameworks introduce mechanisms—to fuse motion cues or aggregate global-local features—specially designed to address small-object detection, an enduring challenge for single-stage detectors. In YOLOMG, integration of a motion-guided difference map produced class-leading improvements for drone surveillance in unconstrained conditions. The remote sensing DMG-YOLO leverages transformer-based context and multi-scale fusion (without motion cues) to elevate mAP for small targets at minimal parameter cost.

A plausible implication is that both motion priors (temporal, pixel-level guidance) and multi-branch global-local aggregation can be synergistically combined for further improvement in complex environments characterized by real-time constraints and small-object prevalence.

6. Limitations and Open Challenges

YOLOMG: Susceptible to failure when motion cues are weak (e.g., hovering objects or slow movement), or where background motion mimics the appearance/motion of true targets. FP increase on non-drone moving objects in cluttered scenes.
Remote Sensing DMG-YOLO: The introduction of transformers and multi-branch modules increases FLOPs (from 8.2 G for YOLOv8n to 12.4 G for DMG-YOLO-n); model speed (FPS) and storage size in MB are not reported (Wang et al., 21 Nov 2025).
Precision and recall metrics are not universally provided, limiting head-to-head comparison beyond mAP $_{50}$ .

7. Future Directions

Emerging directions include: (1) joint use of external priors (e.g., radar/thermal/motion) and learned visual features, (2) low-rank or quantized variants for resource-constrained platforms, (3) targeted robustness to camouflaged or ultra-tiny objects, and (4) more diverse multi-modal fusion architectures, combining the strengths of both motion-guided and transformer-fused pathways as exemplified in the two DMG-YOLO variants. Further benchmarking on real-world, diverse, and adversarial datasets—in both drone and remote-sensing domains—remains necessary to fully map the strengths and systemic limitations of these architectures.

For further details, see “YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion” (Guo et al., 10 Mar 2025) and “A lightweight detector for real-time detection of remote sensing images” (Wang et al., 21 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion (2025)

A lightweight detector for real-time detection of remote sensing images (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DMG-YOLO.