ScaleBridge-Det: Scale-Aware Detection Framework

Updated 8 December 2025

ScaleBridge-Det is a detection framework for remote sensing imagery that employs scale-adaptive expert fusion and density-aware query allocation to balance tiny and general object detection.
It integrates a Routing-Enhanced Mixture Attention module that fuses outputs from scale-specialist experts using category-aware routing to enhance multi-scale feature representation.
The framework incorporates a Density-Guided Dynamic Query system that dynamically allocates queries based on scene density, achieving state-of-the-art performance in high-density scenarios.

ScaleBridge-Det is a large-scale detection framework specifically designed to address the challenge of balanced tiny and general object detection in remote sensing imagery. Developed as the first detection system of its scale and scope with explicit mechanisms for scale-adaptive expert fusion and density-aware query allocation, ScaleBridge-Det achieves state-of-the-art results in extremely high-density scenarios and under severe scale variation (Zhao et al., 1 Dec 2025). Its architecture introduces a Routing-Enhanced Mixture Attention (REM) module for multi-scale expert fusion and a Density-Guided Dynamic Query (DGQ) system for dynamic query allocation, thereby resolving the fundamental scale gap present in conventional object detectors.

1. Architectural Overview

ScaleBridge-Det follows an encoder–decoder paradigm, augmented beyond canonical DETR-style detectors by two pivotal components: (1) a Mixture-of-Experts (MoE) backbone with routing-and-fusion via REM, and (2) a DGQ mechanism that allocates object queries dynamically based on estimated scene density. The processing flow is as follows:

An input image $I \in \mathbb{R}^{3 \times H \times W}$ is passed through a base backbone (typically a shallow CNN), producing an initial feature map $M$ .
REM leverages a collection of frozen experts (ResNet, ViT, Swin) categorized by scale specialization, and adaptively routes, weights, and fuses their outputs into multi-scale pyramids $\{P_\ell\}$ .
DGQ predicts a fine-grained pixel-wise density map, informing both the number and placement of DETR-style object queries.
These queries are processed by the decoder to yield class logits and boxes.

The ensemble is trained end-to-end, with only REM’s routing/gating logic, the DGQ density predictor, and the decoder learned during fine-tuning; expert backbones remain frozen in later training stages (Zhao et al., 1 Dec 2025).

2. Routing-Enhanced Mixture Attention (REM)

Motivation

Remote sensing scenes contain a wide range of object scales, from minute vehicles barely spanning a few pixels to expansive ground structures. Single-backbone architectures invariably favor a dominant scale regime, resulting in biased feature representations and degraded performance either on tiny or large objects. REM circumvents this by:

Maintaining specialized experts (e.g., ResNet for local/tiny details, ViT/Swin for large/global context) pre-trained on suitable datasets.
Implementing multi-level routing to ensure that at least one expert from each scale-specialist category always influences the fused representation, by enforcing a top-1 selection per category.

Technical Formulation

Let $M \in \mathbb{R}^{C \times H' \times W'}$ be the base feature map. Routing is performed via an MLP

$[s_1,\dots,s_E]=\mathrm{Softmax}\left(\mathrm{FC}(\mathrm{Flatten}(\mathrm{GAP}(M)))\right)$

with category-wise top-1 within: $\mathcal{A} = \left\{ e\mid e = \arg\max_{e' \in \mathcal{C}_k} s_{e'} \text{ for each category }k \right\}$ where $E$ is expert count, $\mathcal{C}_k$ indexes expert categories (“tiny-specialist”, “general-specialist”, “mixed”).

Each selected expert generates a set of FPN-style features $\{F_e^l\}_{l=1}^{L}$ , normalized by $\widetilde F^l_e = \alpha_e F^l_e$ (where $\alpha_e$ is a learned scale). Level-wise gates $w_{l,e}$ (Softmax over GAP-concatenated pyramids) balance relative contributions: $F_l^\mathrm{fused} = \sum_{e\in\mathcal{A}} s_e w_{l, e} \widetilde F_e^l$ A multi-head attention (MHA) block integrates level-wise fused features.

Scale Bias Mitigation

The structure guarantees:

Each scale-specialist expert participates, as enforced by the category-aware routing.
Learnable normalization and dynamic level gates prevent domination by a single expert.
Routing and fusion are softmax-based, yielding complementary multi-scale representations.

3. Density-Guided Dynamic Query (DGQ)

Rationale and Mechanism

Uniform and static query allocation in DETR yields capacity imbalance: sparse scenes are over-queried, while dense-tiny regimes run out of queries and neglect many objects. DGQ directly estimates a density map $D^{\rm pred}$ from the highest-resolution backbone feature $P_1$ , using a small convolutional head. Supervision combines L2 (MSE) and weighted BCE against a ground-truth density map built by scattering Gaussian kernels (σ=1.5 px) at annotated object centers.

Total density is

$N = \mathrm{round}\left( \sum_{p} D^{\rm pred}(p) \right)$

Queries are allocated via a piecewise rule: $Q_n = \begin{cases} 900 & N \leq 10\ 1200 & 10 < N \leq 100\ 1500 & 100 < N \leq 500\ 2000 & N > 500 \end{cases}$ Initial query centers are sampled in proportion to the predicted density, tightly focusing queries to object-rich regions and adapting capacity to scene complexity.

4. Loss Functions and Training Protocol

The total loss is composed of:

Standard DETR set-based detection loss:
- $\mathcal{L}^{\rm det}_{\rm cls}$ : cross-entropy or focal loss over matched queries.
- $\mathcal{L}^{\rm det}_{\rm bbox}$ : $\lambda_{\ell 1}$ L1 norm plus $\lambda_{\rm giou}$ GIoU loss.
Density-prediction loss: $\mathcal{L}_{\rm density}$ (MSE + small BCE term).
No explicit routing loss: the Softmax with top-1 categorical routing is sufficient for diversity.

Total loss: $\mathcal{L} = \mathcal{L}^{\rm det}_{\rm cls} + \mathcal{L}^{\rm det}_{\rm bbox} + \lambda_{\rm density} \mathcal{L}_{\rm density}$ Expert backbones are rigidly frozen after progressive pre-training, with only fusion, routing, DGQ, and decoder weights trained in the final phase. Standard AdamW optimization and learning-rate warm-up strategies are employed; density loss is introduced after initial DETR loss stabilization (Zhao et al., 1 Dec 2025).

5. Empirical Results and Comparative Performance

ScaleBridge-Det demonstrates state-of-the-art performance on both benchmark and cross-domain remote sensing datasets:

AI-TOD-v2

2B params: AP = 33.6 %, AP50 = 71.8 %, APvt = 14.2, APt = 30.4, APs = 35.1, APm = 46.8 %
3B params: AP = 35.7 %, AP50 = 72.1 %, APvt = 16.2, APt = 34.5, APs = 39.5, APm = 51.4 %
Outperforms all contemporary general and remote-sensing baselines (CoDETR, D3Q, LAE-DINO, DINOv3-sat).

DTOD

AP50 = 29.8 % (AP50^S = 24.0, AP50^M = 26.0, AP50^L = 42.0); best prior (SCDNet) at 24.2 %.

Cross-domain (VisDrone)

mAP = 30.7 %, mAP50 = 58.8 % (vs LAE-DINO 29.5 %/57.5 %, CoDETR-Res50 11.8 %/24.6 %).

Ablation studies show that REM and DGQ are synergistic. REM alone adds +2.5 AP, DGQ alone +0.9 AP; together, the improvement is greater than the sum, especially for tiny objects (APvt).

6. Insights, Robustness, and Limitations

Progressive expert pre-training is critical to retain backbone specialization and avoid degenerate routing in large models. The category-wise top-1 selection in REM is indispensable for consistent improvement in both tiny and large object detection regimes. DGQ mitigates out-of-memory failures in extremely dense scenes and reallocates computational focus adaptively. Empirically, even smaller variants (800M params, fewer experts) outperform single-backbone baselines, indicating parameter efficiency under this design.

ScaleBridge-Det demonstrates greater cross-domain robustness than existing methods: on VisDrone (without fine-tuning), FAR/FP rates are lower and both tiny/large object recalls are improved relative to competing frameworks.

The principal limitations involve resource demands: large (2–3B parameter) configurations require substantial GPU memory and training time, precluding deployment on resource-constrained platforms. Possible future research directions include lighter-weight expert pruning or quantization and the extension of density prediction to multi-scale pyramids. Further work is anticipated in adapting the approach to other modalities and in exploring joint open-vocabulary/multi-modal pre-training for remote-sensing detection (Zhao et al., 1 Dec 2025).

7. Practical Applications and Future Directions

ScaleBridge-Det is suited for operational scenarios in overhead imagery analysis, UAV-based surveillance, and high-density remote sensing applications where objects of radically differing scales must be detected in the same image. Quantitative benchmarks indicate that ScaleBridge-Det effectively bridges the scale and density gap that has previously stymied the application of DETR-based and transformer-based models in remote sensing. Anticipated directions include adaptation for lightweight/UAV deployment, multi-modal imagery (e.g., SAR, hyperspectral), and further scaling of the expert pool to enhance open-domain robustness and vocabulary expansion (Zhao et al., 1 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ScaleBridge-Det.