DiVE: Fine-Tuning with Geometric Consistency
- DiVE is a robust fine-tuning methodology for vision-language models that preserves both global and local embedding geometries.
- It employs two quadratic penalty regularizers—the Average Vector Loss (AVL) and Pairwise Vector Loss (PVL)—to maintain geometric consistency during fine-tuning.
- DiVE significantly enhances out-of-distribution and zero-shot performance while nearly preserving the original embedding structure.
Difference Vector Equalization (DiVE) is a robust fine-tuning methodology designed for vision-LLMs (VLMs), such as CLIP, which are typically pre-trained with contrastive objectives. Unlike conventional robust fine-tuning strategies that risk distorting the geometric structure of embedding spaces—crucial for maintaining out-of-distribution (OOD) and zero-shot generalization—DiVE introduces explicit constraints on the difference vectors between pre-trained and fine-tuned embeddings. By enforcing global and local geometric consistency through specialized regularizers, DiVE achieves state-of-the-art preservation of both representation structure and transferability during in-distribution (ID) adaptation.
1. Formal Definition and Notation
Let denote a pre-trained encoder and its fine-tuned counterpart. In dual-encoder VLMs, these correspond to separate image and text encoders. For a data point (image) or (text prompt), the difference vector encapsulates the impact of fine-tuning: Given a reference set of paired samples , define:
where parameterize the image and text encoders, respectively.
2. Loss Formulation and Regularization
DiVE introduces two quadratic penalty regularizers to the standard contrastive fine-tuning loss. These regularizers enforce that the transformation induced by fine-tuning is nearly a rigid translation across the embedding space, limiting any warping detrimental to transfer ability.
2.1 Average Vector Loss (AVL)
Intuition: Utilize a global “center” —an exponential moving average of all difference vectors—so that all differences are nearly equal, thus maintaining the global structure.
Computation of in each mini-batch of reference pairs:
with decay (typ. $0.99$), and initialized at $0$.
Loss:
2.2 Pairwise Vector Loss (PVL)
Intuition: Enforce for each reference image–caption pair that their corresponding difference vectors are matched, guaranteeing local cross-modal alignment.
Loss:
2.3 Combined Fine-tuning Objective
Starting from a contrastive loss (as in FLYP) on the primary in-distribution data,
the total objective is given by
where is a regularization weight.
3. Preservation of Embedding Space Geometry
The design of DiVE targets two distinct aspects of geometric preservation:
- Global Consistency (AVL): The average vector loss compels all difference vectors to concentrate around a single mean vector , effecting a near-uniform translation for all embedding points. This ensures the global configuration is preserved—no embeddings drift unevenly, maintaining pairwise relationships across the entire dataset.
- Local Alignment (PVL): The pairwise vector loss enforces that for every reference image–caption pair, the two modalities shift identically during fine-tuning, retaining subspace alignment critical for consistent cross-modal matching.
These constraints result in the updated embeddings being nearly an isometry (rigid translation) of the original ones.
4. Comparison to Contrastive-Replay Fine-tuning Methods
Prior robust fine-tuning strategies, such as FLYP and ARF, rely on contrastive learning over a reference set but lack explicit geometric controls. Empirically, these methods reduce the Representation Similarity Analysis (RSA) correlation—an indicator of preserved rank-order pairwise distances—between pre-trained and fine-tuned models to on datasets like Flickr8K. This geometric distortion directly impairs OOD and zero-shot performance due to the disruption of transferrable structure on the hypersphere.
In contrast, DiVE retains RSA correlation , corresponding to an almost rigid translation of the embedding space. This geometric invariance is the principal mechanism underlying DiVE’s enhanced generalization under domain shift and label-mismatch regimes.
5. Experimental Evaluation
DiVE was evaluated on a suite of standard ID, OOD, and zero-shot benchmarks with a ViT-B/16 backbone. The results demonstrate:
| Method | ID (ImageNet) | OOD avg | ZS avg |
|---|---|---|---|
| Vanilla FT | 81.3% | 54.9% | 35.7% |
| FLYP / ARF | 82–83% | ~59–61% | 50–56% |
| DiVE | 82.5% | 63.2% | 63.7% |
- In-Distribution (ID): Evaluated on ImageNet, iWildCam (macro-F₁), and FMoW (worst-region OOD).
- Out-of-Distribution (OOD): ImageNet derivatives (V2, R, A, Sketch, ObjectNet), iWildCam (held-out cameras), FMoW (held-out regions).
- Zero-Shot (ZS): Ten public classification benchmarks, including Caltech-101, Flowers, Food-101, SUN397, etc., assessed with the prompt “a photo of a [class]”.
DiVE achieves second-best ID accuracy, but distinctly superior performance in OOD and zero-shot settings, exceeding FLYP by +3–4 percentage points in OOD and +14 in ZS metrics.
6. Computational Considerations and Limitations
- Resource Demand: DiVE incurs higher computational cost due to forward/backward passes over the reference set each batch. On a ViT-B/16 + ImageNet configuration, epoch-wise training time and GPU memory usage are increased by and , respectively.
- Hyperparameter Sensitivity: Effective deployment is contingent on selecting suitable values for (regularization weight), (moving-average decay), and the reference set size. Diminished performance is observed with small reference sets.
- Scalability: Since each batch requires operations over the full reference set, applying DiVE to very large-scale models is computationally intensive. Reducing overhead via alternatives (e.g., gradient-match surrogates for difference vectors) remains an open research direction.
- Theoretical Understanding: A comprehensive theoretical account explaining the optimality of uniform difference vectors for generalization is still forthcoming.
7. Broader Implications
By constraining the difference vectors rather than re-optimizing pairwise similarities, DiVE enables fine-tuning on new data without destroying the geometric structure that underpins OOD and zero-shot transfer robustness. This method elucidates that uniform, nearly isometric transformation of the embedding manifold suffices to balance adaptation with generalization, suggesting future robust adaptation research may benefit from geometric regularization paradigms. Further scaling and theoretical advances are pending, influencing applicability to ever-larger multi-modal and unimodal pre-trained models.