DiVLAD: Global Descriptor Aggregation for SfM

Updated 24 January 2026

DiVLAD is a global descriptor aggregation method that fuses ViT self-attention cues with patch-level features to encode geometric matchability in SfM.
The approach employs a learnable gating mechanism to adaptively combine per-head attention maps and visual features, enhancing candidate image retrieval.
DiVLAD integrates into the SupScene pipeline using overlap-aware, supervised contrastive learning to optimize geometric alignment and retrieval performance.

DiVLAD is a global descriptor aggregation method designed for retrieval tasks in large-scale, unconstrained Structure-from-Motion (SfM) scenarios. Introduced within the SupScene pipeline, DiVLAD leverages the final-layer multi-head self-attention maps of Vision Transformers (ViTs), specifically DINOv2-B, fusing these cues with patch-level visual features through a cluster-adaptive, learnable gating mechanism. This architecture enables the construction of global image descriptors that are more discriminative with respect to geometric overlap, outperforming prior methods such as NetVLAD in candidate image retrieval for downstream SfM pipelines (Shi et al., 17 Jan 2026).

1. Background: Global Descriptor Aggregation in SfM

Traditional SfM on unordered photo collections requires an efficient image retrieval step to select geometrically matchable image pairs, dramatically reducing the quadratic complexity of matching all image pairs. Conventional global descriptors, such as NetVLAD and GeM, are typically trained under semantic-similarity paradigms, which do not capture the geometric overlap relevant for SfM. DiVLAD advances this by focusing on the fusion of self-attention and visual features to directly encode geometric matchability into the global descriptor, leveraging supervised contrastive learning over subgraphs reflecting real geometric relationships.

2. DiVLAD Architecture and Mechanism

DiVLAD is built upon a DINOv2-B ViT backbone, with only the final transformer block fine-tuned for retrieval. It extends the NetVLAD paradigm by incorporating per-head self-attention maps from the final block, which are interpreted as semantically salient cues localized to the image patches. The construction of the DiVLAD descriptor proceeds as follows:

Feature and Attention Extraction: The input image is resized to 322×322 pixels and processed by DINOv2-B. The last transformer block yields a patch-embedding map $\mathbf{F}\in\mathbb{R}^{C\times H\times W}$ (reshaped to $\{\mathbf{x}_n\}_{n=1}^{N}$ ), and per-head [CLS]-to-patch attention maps $\mathbf{A}_{h,n}$ for $N_h=8$ heads.
Cluster Assignment: A lightweight convolutional head computes NetVLAD-style soft cluster-assignment scores $a_{k,n}$ for $K=64$ clusters.
Learnable Gating: For every attention head $h$ , the map is normalized to yield per-token quality scores $w_{h,n}$ using the sigmoid of its z-score raised to $\gamma_g=0.5$ . Per-head confidence $s_h$ is computed as the mean token quality modulated by the entropy of attention. Cluster-head gating weights $g_{k,h}$ are then computed as $\mathrm{softplus}(\mathbf{G}_{k,h})\times s_h$ , normalized over heads.
VLAD Aggregation: The gated VLAD residual for cluster $k$ is given by

$\mathbf{v}_k = \sum_{n=1}^N a_{k,n} \left( \sum_{h=1}^{N_h} g_{k,h} w_{h,n} \right) (\mathbf{x}_n - \mathbf{c}_k),$

where $\mathbf{c}_k$ denotes the $k$ th cluster centroid. The final descriptor is the concatenation and normalization of all $\{\mathbf{v}_k\}_{k=1}^K$ .

The gating mechanism enables adaptive fusion of semantic and visual cues per cluster, allowing DiVLAD to highlight co-visible regions and suppress less informative content.

3. Integration in SupScene: Ground-Truth Overlap Supervision

Within SupScene, DiVLAD is trained using a soft, subgraph-based supervised contrastive loss. Training batches are sampled as subgraphs from scene overlap graphs, where ground-truth geometric overlap ratios $w_{ij}$ are computed based on co-visible 3D points. The loss function incorporates these overlap ratios as soft, nonbinary weights on pairwise similarities among descriptors in a batch, promoting fine-grained alignment of descriptor similarity with geometric overlap.

Sampling strategies ensure sufficient positive pairs and neighborhood structure per batch: BFS-style anchor-expansion samples neighbors with $w_{ij}\ge\tau_{iou}$ , and balanced sampling maintains the target positive-pair ratio $\rho$ . The result is a pipeline that directly optimizes for retrieval performance under the geometric criteria relevant for SfM.

4. Experimental Evaluation and Comparative Performance

On the GL3D benchmark, DiVLAD demonstrates state-of-the-art retrieval performance with minimal parameter overhead. The following table summarizes key retrieval results (recall@25/100):

Method	Backbone	Dim	Recall@25	Recall@100
SiaMAC	CNN	512	59.4	88.6
NetVLAD	CNN	16,384	58.8	87.8
MIRorR	CNN	256	61.1	90.3
IMvGCN	CNN	2,048	70.8	70.8
DINOv2 AVG	ViT	768	71.5	96.6
DINOv2 DiVLAD	ViT	24,576	73.0	97.2

Ablation studies confirm the contributions of both the soft overlap-weighted loss and the learnable gating mechanism: Replacing Soft SupConLoss with the standard variant reduces Recall@25 from 73.0 to 72.3; removing the gate reduces mAP@25 from 79.8 to 79.6. Downstream, using DiVLAD retrieval in SfM registration with COLMAP yields the highest number of registered images on 1DSfM scenes, outperforming contemporary alternatives (Shi et al., 17 Jan 2026).

5. Computational Cost and Model Capacity

The DiVLAD architecture introduces a minimal additional parameter footprint over the DINOv2 backbone: approximately 0.3M parameters in the conv-head for cluster assignment and 512 scalars for the cluster-head gating matrix. The per-image inference time is effectively unchanged compared to NetVLAD with DINOv2, as attention maps are natively produced during ViT forward passes and the gating module adds negligible computational overhead. The total descriptor dimensionality is $64 \times C$ , with $K=64$ clusters and $C$ patch embedding channels.

6. Impact, Limitations, and Prospects

DiVLAD, as part of the SupScene pipeline, sets new standards for retrieval performance in unconstrained SfM, enabling more efficient and accurate candidate pair selection with direct geometric supervision. The pipeline demonstrates transferability of its overlap-aware training strategy to other aggregation methods, although DiVLAD attains the top performance. Limitations include dependence on ViT's patch-level representations and the need for explicit geometric overlap computation during training.

Potential future developments encompass richer subgraph sampling schemas (e.g., with variable neighborhood size), integration with graph neural network refinements for descriptor pooling, and extension to video-based or real-time retrieval scenarios in SfM (Shi et al., 17 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiVLAD.