Density-Guided Dynamic Query (DGQ)

Updated 8 December 2025

DGQ is a mechanism that uses learned density maps to dynamically modulate query generation and positioning, optimizing detection for tiny and crowded objects.
It integrates density estimation into transformer-based detectors via weighted sampling or top-K selection, ensuring efficient resource allocation based on scene complexity.
Empirical studies show that DGQ enhances AP performance in remote sensing and crowd detection tasks by addressing the limitations of fixed-query strategies.

Density-Guided Dynamic Query (DGQ) comprises a class of mechanisms that utilize learned or estimated object density maps to dynamically modulate query generation and positioning for transformer-based object detectors, with the aim of improving instance detection in scale-diverse, crowded, or density-imbalanced imagery. This approach addresses the inefficiencies of fixed-query detectors, which are often suboptimal for scenes with high object density variation, and enhances performance particularly for tiny or crowded objects. DGQ methods have been integrated into major detection architectures for tiny object detection in remote sensing, dynamic query selection in tiny object datasets, and 2D/3D crowd detection, as exemplified by ScaleBridge-Det (Zhao et al., 1 Dec 2025), DQ-DETR (Huang et al., 4 Apr 2024), and CrowdQuery (Dähling et al., 10 Sep 2025).

1. Motivation and Conceptual Foundations

The core motivation for DGQ stems from the mismatch between global, fixed-budget query protocols of traditional DETR-style detectors and the variable, often highly non-uniform spatial distribution of objects in real-world or remote sensing imagery. In high-density settings, a static number of queries leads to "query starvation" in crowded regions, degrading recall for small or overlapping targets. Conversely, for sparse scenes, the fixed query budget results in computational waste and potential memory overflow during matching steps. DGQ introduces adaptivity by:

Predicting a spatial object density map per image/scene.
Estimating the global object count or density distribution.
Dynamically adjusting both the number and locations of detection queries conditioned on these density signals.

This guidance allows the model to allocate queries more efficiently to high-density regions (tiny or crowded objects), avoids over-provisioning on sparse images, and enables better resource utilization and detection performance for varied object scale and scene complexity (Zhao et al., 1 Dec 2025, Huang et al., 4 Apr 2024, Dähling et al., 10 Sep 2025).

2. Mathematical Foundations and Losses

DGQ modules differ in the precise construction of density maps and their integration, but several common elements are notable:

2.1 Density Map Definition

Point-based (ScaleBridge-Det): For object centers $\{c_i\}$ , the ground-truth density is:

$D^{gt}(p) = \sum_i \exp\left(-\frac{\|p - c_i\|^2}{2\sigma^2}\right)$

with $\sigma = 1.5$ . The map is L1-normalized to the total instance count (Zhao et al., 1 Dec 2025).

Bounding-box–windowed (CrowdQuery):

$f_i(x, y) = \begin{cases} \exp\left(-\frac{1}{2}\left[\left(\frac{x - \mu_{x,i}}{\sigma_{x,i}}\right)^2 + \left(\frac{y - \mu_{y,i}}{\sigma_{y,i}}\right)^2\right]\right), & \text{if } (x,y) \in B_i \ 0, & \text{otherwise} \end{cases}$

with $\sigma_{x,i} = d \cdot w_i$ , $\sigma_{y,i} = d \cdot h_i$ , $d=1/3$ , centered at bounding-box $\mu_i$ (Dähling et al., 10 Sep 2025).

Implicit Feature-based (DQ-DETR): No explicit ground-truth density formula is given. The “density map” is a conv-feature trained via a counting classification loss (Huang et al., 4 Apr 2024).

2.2 Density Prediction Losses

ScaleBridge-Det: Supervision combines pixelwise MSE ( $L_2$ ) with binary cross-entropy over object center pixels:

$\mathcal{L}_{\text{density}} = \|D^{pred} - D^{gt}\|_2^2 + \lambda_{cls} \cdot \mathcal{L}_{\textrm{BCE}}(D^{pred}, M^{gt})$

where $M^{gt}$ is a binary mask for object centers, $\lambda_{cls}=1$ (Zhao et al., 1 Dec 2025).

DQ-DETR: A cross-entropy loss is used to predict the count class (e.g., $N\leq10$ , $10<N\leq100$ , etc.) (Huang et al., 4 Apr 2024).
CrowdQuery: The density prediction branch is supervised via the density map defined above, with discretization into bins and fusion into query features (Dähling et al., 10 Sep 2025).

3. Dynamic Query Count and Spatial Allocation

Dynamic adjustment of query number and placement is a central component of DGQ:

ScaleBridge-Det: The total predicted density is summed and rounded:

$N = \mathrm{round}\left(\sum_p D^{pred}(p)\right)$

Query count $Q$ is assigned by piecewise tiers:

| $N$ (estimated count) | $Q$ (query count) | |-----------------------|------------------| | $N \leq 10$ | 900 | | $10 < N \leq 100$ | 1200 | | $100 < N \leq 500$ | 1500 | | $N > 500$ | 2000 |

Query positions are initialized by weighted sampling: $P(p\ \textrm{selected}) \propto D^{pred}(p)$ (Zhao et al., 1 Dec 2025).

DQ-DETR: The count bin is predicted (via categorical softmax). The selected bin determines the budget $K$ :

| Bin / $N$ | $K$ (queries) | |-----------------------|------------------| | $N \leq 10$ | 300 | | $10 < N \leq 100$ | 500 | | $100 < N \leq 500$ | 900 | | $N > 500$ | 1500 |

Query positions are obtained by selecting top- $K$ highest-scoring features from an enhanced feature map (post attention/fusion with density features) (Huang et al., 4 Apr 2024).

CrowdQuery: All queries receive dynamic density guidance via cross-attention with a fused “density embedding” feature map, adaptively encoding local density at each decoder layer. No explicit adjustment of $N_{q}$ per image, but query focus is modulated by the multi-head density cross-attention mechanism (Dähling et al., 10 Sep 2025).

4. Architectural Integration of DGQ Modules

DGQ is modularly inserted into modern object-detection transformer architectures:

ScaleBridge-Det: DGQ is placed after the scale-adaptive Routing-Enhanced Mixture Attention (REM), leveraging the highest-resolution REM feature for density prediction and dynamic query initialization. The outputs form the input to the main DETR decoder. No changes are made to subsequent decoder layers, ensuring full compatibility (Zhao et al., 1 Dec 2025).
DQ-DETR: The categorical counting module operates on encoder features, outputs a density feature map, and guides both feature enhancement (CBAM-style attention) and dynamic query selection before the decoder (Huang et al., 4 Apr 2024).
CrowdQuery: The DGQ module ingests the fused multi-scale backbone feature map, predicts pixelwise density, converts it into embedded density codes, and injects these into the transformer decoder via residual cross-attention at every decoder stage. This allows iterative refinement of density-conditioned query representations (Dähling et al., 10 Sep 2025).

5. Implementation Details and Hyperparameters

Key implementation characteristics of DGQ variants include:

Model	Density Ground-truth	Query Budget Range	Query Position Allocation	Loss Weighting/Reg
ScaleBridge-Det	Gaussian at centers ( $\sigma=1.5$ )	900–2000	Weighted sampling, $D^{pred}$	$L_2$ + BCE, $\lambda_{cls}=1$
DQ-DETR	Implicit (conv-feature)	300–1500	Top-K by FFN, enhanced feat	Count classification
CrowdQuery	Gaussian-Boxed	Fixed $N_q$	Cross-attn w/ $F_{dens}$	Bin/embedding, layernorm

Additional details include density head structure (e.g., three 3×3 followed by one 1×1 conv), input resolution (e.g., 1024×1024), optimization strategy (e.g., AdamW), and conventional DETR post-processing (e.g., thresholding at $\tau_{conf}=0.5$ ) (Zhao et al., 1 Dec 2025, Huang et al., 4 Apr 2024, Dähling et al., 10 Sep 2025).

6. Quantitative Impact and Ablation

DGQ provides significant empirical advances:

ScaleBridge-Det on AI-TOD-v2: Adding DGQ over CoDETR baseline yields AP $32.6\%$ (vs. $31.7\%$ ), with very-tiny ( $\mathrm{AP}_{vt}$ ) increase from $13.9\%$ to $15.7\%$ . The synergy of DGQ and REM yields AP $36.7\%$ ( $\mathrm{AP}_{vt}=15.8\%$ ) (Zhao et al., 1 Dec 2025).
ScaleBridge-Det on DTOD (dense): With DGQ, achieves $\mathrm{AP}_{50}=29.8\%$ vs. previous state-of-the-art $24.2\%$ (SCDNet), with strong balance across size subgroups.
DQ-DETR on AI-TOD-v2: DQ-DETR achieves overall mAP $30.2\%$ , outperforming DINO-DETR baseline ( $25.9\%$ ). Per-density-class improvements over the fixed-900-query baseline range from $+3.6\,\mathrm{AP}$ (low density) to $+1.2\,\mathrm{AP}$ (ultra-high density) (Huang et al., 4 Apr 2024).
CrowdQuery (STCrowd): CQ2D increases 2D AP from $89.6\%$ to $91.4\%$ ( $+1.8$ ), and CQ3D boosts 3D mAP from $48.5$ to $52.7$ ( $+4.2$ ) (Dähling et al., 10 Sep 2025).

Ablation studies confirm DGQ’s isolated value and synergetic improvements when combined with scale or counting adaptations.

7. Extensions, Generality, and Limitations

DGQ mechanisms are general-purpose plugins for transformer-based detection backbones (e.g., DETR, DINO, Deformable DETR). The only architectural requirement is the existence of spatial query-like mechanisms and a cross-attention decoder stage. DGQ’s reliance on density map definitions enables transfer to video, 3D, or multimodal regimes as long as a dense annotation (e.g., bounding boxes, instance centers) is available (Dähling et al., 10 Sep 2025).

Typical limitations include increased parameter count (e.g., +12.9% for CQ3D), modest inference runtime penalties, and sensitivity to density hyperparameters ( $\sigma$ , $d$ , bin choices). The quality of the density prediction head can govern overall system performance, and proper regularization is necessary to avoid spurious density predictions. Prospective directions include density-aware panoptic segmentation, multi-camera fusion, and 3D extension for LiDAR/point cloud scenes, contingent on appropriate density annotation schemes (Dähling et al., 10 Sep 2025).

References

"Bridging the Scale Gap: Balanced Tiny and General Object Detection in Remote Sensing Imagery" (Zhao et al., 1 Dec 2025)
"DQ-DETR: DETR with Dynamic Query for Tiny Object Detection" (Huang et al., 4 Apr 2024)
"CrowdQuery: Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes" (Dähling et al., 10 Sep 2025)