Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiVLAD: Global Descriptor Aggregation for SfM

Updated 24 January 2026
  • DiVLAD is a global descriptor aggregation method that fuses ViT self-attention cues with patch-level features to encode geometric matchability in SfM.
  • The approach employs a learnable gating mechanism to adaptively combine per-head attention maps and visual features, enhancing candidate image retrieval.
  • DiVLAD integrates into the SupScene pipeline using overlap-aware, supervised contrastive learning to optimize geometric alignment and retrieval performance.

DiVLAD is a global descriptor aggregation method designed for retrieval tasks in large-scale, unconstrained Structure-from-Motion (SfM) scenarios. Introduced within the SupScene pipeline, DiVLAD leverages the final-layer multi-head self-attention maps of Vision Transformers (ViTs), specifically DINOv2-B, fusing these cues with patch-level visual features through a cluster-adaptive, learnable gating mechanism. This architecture enables the construction of global image descriptors that are more discriminative with respect to geometric overlap, outperforming prior methods such as NetVLAD in candidate image retrieval for downstream SfM pipelines (Shi et al., 17 Jan 2026).

1. Background: Global Descriptor Aggregation in SfM

Traditional SfM on unordered photo collections requires an efficient image retrieval step to select geometrically matchable image pairs, dramatically reducing the quadratic complexity of matching all image pairs. Conventional global descriptors, such as NetVLAD and GeM, are typically trained under semantic-similarity paradigms, which do not capture the geometric overlap relevant for SfM. DiVLAD advances this by focusing on the fusion of self-attention and visual features to directly encode geometric matchability into the global descriptor, leveraging supervised contrastive learning over subgraphs reflecting real geometric relationships.

2. DiVLAD Architecture and Mechanism

DiVLAD is built upon a DINOv2-B ViT backbone, with only the final transformer block fine-tuned for retrieval. It extends the NetVLAD paradigm by incorporating per-head self-attention maps from the final block, which are interpreted as semantically salient cues localized to the image patches. The construction of the DiVLAD descriptor proceeds as follows:

  • Feature and Attention Extraction: The input image is resized to 322×322 pixels and processed by DINOv2-B. The last transformer block yields a patch-embedding map FRC×H×W\mathbf{F}\in\mathbb{R}^{C\times H\times W} (reshaped to {xn}n=1N\{\mathbf{x}_n\}_{n=1}^{N}), and per-head [CLS]-to-patch attention maps Ah,n\mathbf{A}_{h,n} for Nh=8N_h=8 heads.
  • Cluster Assignment: A lightweight convolutional head computes NetVLAD-style soft cluster-assignment scores ak,na_{k,n} for K=64K=64 clusters.
  • Learnable Gating: For every attention head hh, the map is normalized to yield per-token quality scores wh,nw_{h,n} using the sigmoid of its z-score raised to γg=0.5\gamma_g=0.5. Per-head confidence shs_h is computed as the mean token quality modulated by the entropy of attention. Cluster-head gating weights gk,hg_{k,h} are then computed as softplus(Gk,h)×sh\mathrm{softplus}(\mathbf{G}_{k,h})\times s_h, normalized over heads.
  • VLAD Aggregation: The gated VLAD residual for cluster kk is given by

vk=n=1Nak,n(h=1Nhgk,hwh,n)(xnck),\mathbf{v}_k = \sum_{n=1}^N a_{k,n} \left( \sum_{h=1}^{N_h} g_{k,h} w_{h,n} \right) (\mathbf{x}_n - \mathbf{c}_k),

where ck\mathbf{c}_k denotes the kkth cluster centroid. The final descriptor is the concatenation and normalization of all {vk}k=1K\{\mathbf{v}_k\}_{k=1}^K.

The gating mechanism enables adaptive fusion of semantic and visual cues per cluster, allowing DiVLAD to highlight co-visible regions and suppress less informative content.

3. Integration in SupScene: Ground-Truth Overlap Supervision

Within SupScene, DiVLAD is trained using a soft, subgraph-based supervised contrastive loss. Training batches are sampled as subgraphs from scene overlap graphs, where ground-truth geometric overlap ratios wijw_{ij} are computed based on co-visible 3D points. The loss function incorporates these overlap ratios as soft, nonbinary weights on pairwise similarities among descriptors in a batch, promoting fine-grained alignment of descriptor similarity with geometric overlap.

Sampling strategies ensure sufficient positive pairs and neighborhood structure per batch: BFS-style anchor-expansion samples neighbors with wijτiouw_{ij}\ge\tau_{iou}, and balanced sampling maintains the target positive-pair ratio ρ\rho. The result is a pipeline that directly optimizes for retrieval performance under the geometric criteria relevant for SfM.

4. Experimental Evaluation and Comparative Performance

On the GL3D benchmark, DiVLAD demonstrates state-of-the-art retrieval performance with minimal parameter overhead. The following table summarizes key retrieval results (recall@25/100):

Method Backbone Dim Recall@25 Recall@100
SiaMAC CNN 512 59.4 88.6
NetVLAD CNN 16,384 58.8 87.8
MIRorR CNN 256 61.1 90.3
IMvGCN CNN 2,048 70.8 70.8
DINOv2 AVG ViT 768 71.5 96.6
DINOv2 DiVLAD ViT 24,576 73.0 97.2

Ablation studies confirm the contributions of both the soft overlap-weighted loss and the learnable gating mechanism: Replacing Soft SupConLoss with the standard variant reduces Recall@25 from 73.0 to 72.3; removing the gate reduces mAP@25 from 79.8 to 79.6. Downstream, using DiVLAD retrieval in SfM registration with COLMAP yields the highest number of registered images on 1DSfM scenes, outperforming contemporary alternatives (Shi et al., 17 Jan 2026).

5. Computational Cost and Model Capacity

The DiVLAD architecture introduces a minimal additional parameter footprint over the DINOv2 backbone: approximately 0.3M parameters in the conv-head for cluster assignment and 512 scalars for the cluster-head gating matrix. The per-image inference time is effectively unchanged compared to NetVLAD with DINOv2, as attention maps are natively produced during ViT forward passes and the gating module adds negligible computational overhead. The total descriptor dimensionality is 64×C64 \times C, with K=64K=64 clusters and CC patch embedding channels.

6. Impact, Limitations, and Prospects

DiVLAD, as part of the SupScene pipeline, sets new standards for retrieval performance in unconstrained SfM, enabling more efficient and accurate candidate pair selection with direct geometric supervision. The pipeline demonstrates transferability of its overlap-aware training strategy to other aggregation methods, although DiVLAD attains the top performance. Limitations include dependence on ViT's patch-level representations and the need for explicit geometric overlap computation during training.

Potential future developments encompass richer subgraph sampling schemas (e.g., with variable neighborhood size), integration with graph neural network refinements for descriptor pooling, and extension to video-based or real-time retrieval scenarios in SfM (Shi et al., 17 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiVLAD.