DINOv2 & SALAD: High-Performance Vision Pooling

Updated 11 November 2025

DINOv2 is a self-supervised vision transformer that generates robust patch-level and global embeddings for diverse visual tasks.
SALAD employs ε-regularized optimal transport with Sinkhorn iterations to achieve sharper, semantically meaningful feature clustering.
The combined approach significantly boosts performance in visual place recognition, SLAM, medical imaging, and anomaly detection.

DINOv2 features with SALAD aggregation combine advanced vision transformer descriptors with principled optimal transport pooling for high-performance visual representation. DINOv2, a self-supervised Vision Transformer (ViT) model, produces both patch-level and global embeddings with robust cross-domain semantics, while SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors) replaces classical NetVLAD-style aggregation with rigorously regularized optimal transport assignments. This pairing has demonstrated strong empirical results for image retrieval, visual place recognition, logical anomaly detection, and volumetric medical classification.

1. Architecture and Feature Extraction in DINOv2

DINOv2 employs ViT architectures that process image inputs by splitting them into non-overlapping patches (typically $14 \times 14$ for ViT-B/14), each patch linearly mapped to a $d$ -dimensional embedding (e.g., $d=768$ for ViT-B/14, $d=384$ for ViT-S/14). The sequence of patch tokens is fed into $B$ transformer blocks (e.g., $B=12$ ), which can be fine-tuned by freezing select blocks to control overfitting, as in visual place recognition where only the last $B=4$ blocks are updated (Izquierdo et al., 2023).

A global class token $t_{N+1} \in \mathbb{R}^d$ is prepended to capture context. The backbone outputs are $\{t_1, ..., t_N, t_{N+1}\}$ , where token count $N$ scales with input resolution. These token vectors serve as high-dimensional local descriptors for downstream aggregation. Fine-tuning DINOv2 on relevant targets, such as the GSV-Cities dataset for VPR, is performed with the Multi-Similarity loss, which effectively separates positive (same place) from negative (different place) instances in embedding space. DINOv2's self-distillation and patch-level objectives prevent feature collapse and bias, ensuring diverse and expressive representations (Oquab et al., 2023).

2. SALAD: Sinkhorn Optimal Transport Aggregation

SALAD reframes NetVLAD's soft assignment of descriptors to clusters as an $\varepsilon$ -regularized optimal transport problem. Given $N$ local features (patch tokens) and $M$ learnable cluster centers with an additional "dustbin" cluster, the optimal assignment $P \in \mathbb{R}^{N \times (M+1)}$ minimizes the affine cost $\langle P, C \rangle$ minus $\varepsilon$ -weighted entropy $H(P)$ . Feature-to-cluster ( $\mu$ ) and cluster-to-feature ( $\kappa$ ) marginals enforce mass conservation, with the dustbin column $\kappa_{M+1}$ selectively absorbing outlier features (Izquierdo et al., 2023, Gonzalez et al., 7 Nov 2025).

The cost matrix $C_{i,j} = -\bar{s}_{i,j}$ is computed via a small MLP mapping each token to affinity scores $\bar{s}_i=[s_i; z] \in \mathbb{R}^{M+1}$ , with score $z$ for the dustbin. Sinkhorn iterations iteratively renormalize rows and columns of $K = \exp(-C/\varepsilon)$ to satisfy $\mu$ and $\kappa$ , producing a soft assignment matrix. After convergence, the non-informative dustbin assignments are dropped, yielding $P \in \mathbb{R}^{N \times M}$ for $M$ clusters.

Compared to vanilla VLAD or dual-softmax, Sinkhorn OT yields sharper, more semantically meaningful cluster assignment; ablation removing the dustbin or replacing Sinkhorn leads to notably reduced recall (from $92.2\%$ to $91.4\%/91.9\%$ on MSLS Val) (Izquierdo et al., 2023).

3. Building and Normalizing Global Descriptors

SALAD aggregates local features via a weighted sum: project each patch token $t_i$ to $f_i \in \mathbb{R}^\ell$ with an MLP ( $\ell=128$ is typical), then compute cluster bins $V_j = \sum_{i=1}^N P_{i,j} f_i \in \mathbb{R}^\ell$ . Stacking $V_j$ yields an $M \times \ell$ matrix, flattened to size $M\ell$ (e.g., $8192$ for $M=64$ , $\ell=128$ ). Each $V_j$ is individually $\ell_2$ -normalized (“intra-normalization”), then concatenated with the $\ell_2$ -normalized global token $g\in\mathbb{R}^{G}$ ( $G=256$ ). The resulting vector (e.g., $8448$-dim) is globally $\ell_2$ -normalized for consistent cosine similarity scoring (Izquierdo et al., 2023, Gonzalez et al., 7 Nov 2025).

Power normalization— $v \mapsto \text{sign}(v) \cdot \sqrt{|v|}$ —is optionally applied before the final $\ell_2$ normalization (Gonzalez et al., 7 Nov 2025).

In several variants, raw DINOv2 patch embeddings can also be pooled by self-attention, with the class token (or a learnable query) attending over key-projected patch features at a chosen softmax temperature, then aggregating values accordingly (Oquab et al., 2023).

4. Applications and Quantitative Impact

DINOv2+SALAD is validated in multiple domains:

Visual Place Recognition (VPR): Achieves highest Recall@1 on all benchmarks: MSLS Challenge (75.0%), NordLand (76.0%), MSLS Val (92.2%), Pitts250k-test (95.1%), and SPED (92.1%) (Izquierdo et al., 2023).
Loop Closure in SLAM: On S3LI “Etna”, pretrained DINOv2+SALAD scores P@1=71.16% (390 ms/query); fine-tuned pushes to P@1=75.69%. SALAD screens candidate images, reducing search space by ~20x and improving pose precision in low-texture scenes (yaw error $\approx 8.20^{\circ}$ ) (Gonzalez et al., 7 Nov 2025).
Volumetric Medical Classification: Slice-level attention pooling over DINOv2 embeddings outperforms classical multi-instance learning and 3D CNNs by 3–8% (accuracy/AUC), e.g., achieving 87.8% accuracy, 0.865 AUC in ADNI HC vs AD (Rafsani et al., 15 Sep 2025).
Logical Anomaly Detection: DINOv2-SALAD’s composition map clustering and branch aggregation reach 96.1% AUROC on the MVTec LOCO anomaly benchmark, outperforming prior state-of-the-art methods (Fučka et al., 2 Sep 2025).

SALAD’s dustbin cluster and cluster-to-feature regularization are crucial for robustness, especially when many image patches carry background or noise.

Example Table: Empirical Performance of DINOv2+SALAD on VPR

Dataset	NetVLAD	GeM	MixVPR	EigenPlaces	DINOv2 SALAD
MSLS Challenge	35.1	49.7	64.0	67.4	75.0
NordLand	32.6	21.6	58.4	54.4	76.0
MSLS Val	82.6	78.2	88.0	89.3	92.2
Pitts250k-test	90.5	87.0	94.6	94.1	95.1
SPED	78.7	66.7	85.2	69.9	92.1

5. Training Procedure and Computational Complexity

DINOv2+SALAD typically uses AdamW optimization with a low initial learning rate (e.g., $6 \times 10^{-5}$ ) decayed over epochs. For fine-tuning the backbone (e.g., on VPR tasks), freezing all but the last $B$ blocks ( $B=4$ is found optimal) prevents overfitting—training all blocks actually reduces downstream performance (Izquierdo et al., 2023). Batch sizes (e.g., 60 places × 4 images for VPR) and multi-similarity loss are key for discriminative learning.

SALAD’s computational cost is dominated by pairwise distance calculation ( $\mathcal{O}(N \cdot K \cdot d)$ ), Sinkhorn assignment ( $\mathcal{O}(\text{Sinkhorn\_iters} \cdot N \cdot K)$ ; typically $\approx$ 10 iterations suffice), and residual pooling. With $N=256$ , $d=768$ , $K=64$ , the pipeline is feasible for GPU inference, achieving <3 ms/image throughput (Izquierdo et al., 2023).

Cluster centers and $\varepsilon$ are generally pretrained and remain fixed; retraining SALAD on target data does not yield further improvement (Gonzalez et al., 7 Nov 2025). Descriptor dimensionality (e.g., $K d$ ) can be reduced post hoc (e.g., PCA to $8192$) without significant performance loss.

6. Extensions Across Modalities and Tasks

SALAD aggregation with DINOv2 features extends beyond image retrieval:

Medical Imaging: Attention-based slice aggregation addresses limited labeled data and class imbalance (Rafsani et al., 15 Sep 2025). Composite loss functions blend cross-entropy, supervised contrastive, and class-variance regularization to improve inter-class separability and compactness of volume-level embeddings.
Logical Anomaly Detection: Composition map segmentation exploits DINOv2 self-supervised embeddings, clustering, and mask proposal fusion (SAM-HQ) to create semantic part maps; these are processed by UNet discriminative networks and Mahalanobis anomaly scoring in multi-branch architectures (Fučka et al., 2 Sep 2025).
SLAM and Multi-modal Fusion: In planetary environments, DINOv2+SALAD screens visual candidates for pose estimation, fusing with geometric LiDAR descriptors for reliable loop closures without GNSS (Gonzalez et al., 7 Nov 2025). Compared to raw DINOv2, SALAD's OT assignment yields more interpretable, discriminative correspondences.

A plausible implication is that SALAD’s entropic regularization and dustbin mechanism offer generalizable pooling for any task requiring compact, context-weighted descriptors from foundation model features.

7. Limitations and Ongoing Developments

SALAD’s assignment matrix assumes mass conservation and may be sensitive to cluster number $M$ and regularization parameter $\varepsilon$ ; optimal settings vary by domain and data regime. For highly imbalanced or noisy inputs, dustbin assignment may be crucial. In low-data medical regimes, domain-adaptive DINOv2 pretraining is proposed for further gains (Rafsani et al., 15 Sep 2025). Retraining cluster centers on small target sets generally degrades performance, suggesting the generality of large foundation-model pretraining (Gonzalez et al., 7 Nov 2025). Integration with multimodal (e.g., PET, clinical scores) or hierarchical attention is suggested for future work.

SALAD’s interchangeability for attention pooling, VLAD-style aggregation, or Mahalanobis scoring enables flexible architectures for anomaly detection, classification, or retrieval. The combination of DINOv2 and SALAD currently sets a standard for single-stage visual representation in diverse, challenging domains.