Papers
Topics
Authors
Recent
2000 character limit reached

DiVE: Fine-Tuning with Geometric Consistency

Updated 15 November 2025
  • DiVE is a robust fine-tuning methodology for vision-language models that preserves both global and local embedding geometries.
  • It employs two quadratic penalty regularizers—the Average Vector Loss (AVL) and Pairwise Vector Loss (PVL)—to maintain geometric consistency during fine-tuning.
  • DiVE significantly enhances out-of-distribution and zero-shot performance while nearly preserving the original embedding structure.

Difference Vector Equalization (DiVE) is a robust fine-tuning methodology designed for vision-LLMs (VLMs), such as CLIP, which are typically pre-trained with contrastive objectives. Unlike conventional robust fine-tuning strategies that risk distorting the geometric structure of embedding spaces—crucial for maintaining out-of-distribution (OOD) and zero-shot generalization—DiVE introduces explicit constraints on the difference vectors between pre-trained and fine-tuned embeddings. By enforcing global and local geometric consistency through specialized regularizers, DiVE achieves state-of-the-art preservation of both representation structure and transferability during in-distribution (ID) adaptation.

1. Formal Definition and Notation

Let φpre\varphi_{\mathrm{pre}} denote a pre-trained encoder and φft\varphi_{\mathrm{ft}} its fine-tuned counterpart. In dual-encoder VLMs, these correspond to separate image ff and text gg encoders. For a data point xx (image) or tt (text prompt), the difference vector encapsulates the impact of fine-tuning: Δx:=φft(x)φpre(x),Δt:=φft(t)φpre(t)\Delta_x := \varphi_{\mathrm{ft}}(x) - \varphi_{\mathrm{pre}}(x), \qquad \Delta_t := \varphi_{\mathrm{ft}}(t) - \varphi_{\mathrm{pre}}(t) Given a reference set RR of paired samples {(xiref,tiref)}\{(x_i^{\mathrm{ref}}, t_i^{\mathrm{ref}})\}, define: u(xiref)=fθft(xiref)fθpre(xiref)u(x_i^{\mathrm{ref}}) = f_{\theta_{\mathrm{ft}}}(x_i^{\mathrm{ref}}) - f_{\theta_{\mathrm{pre}}}(x_i^{\mathrm{ref}})

v(tiref)=gϕft(tiref)gϕpre(tiref)v(t_i^{\mathrm{ref}}) = g_{\phi_{\mathrm{ft}}}(t_i^{\mathrm{ref}}) - g_{\phi_{\mathrm{pre}}}(t_i^{\mathrm{ref}})

where θ,ϕ\theta, \phi parameterize the image and text encoders, respectively.

2. Loss Formulation and Regularization

DiVE introduces two quadratic penalty regularizers to the standard contrastive fine-tuning loss. These regularizers enforce that the transformation induced by fine-tuning is nearly a rigid translation across the embedding space, limiting any warping detrimental to transfer ability.

2.1 Average Vector Loss (AVL)

Intuition: Utilize a global “center” mm—an exponential moving average of all difference vectors—so that all differences are nearly equal, thus maintaining the global structure.

Computation of mm in each mini-batch of BB' reference pairs:

mαmprev+(1α)1Bj=1Bu(xjref)+v(tjref)2m \leftarrow \alpha m_{\text{prev}} + (1-\alpha) \frac{1}{B'} \sum_{j=1}^{B'} \frac{u(x_j^{\mathrm{ref}}) + v(t_j^{\mathrm{ref}})}{2}

with decay α[0,1)\alpha \in [0,1) (typ. $0.99$), and mprevm_{\text{prev}} initialized at $0$.

Loss:

Lavl=1Bj=1B[u(xjref)m2+v(tjref)m2]L_{\mathrm{avl}} = \frac{1}{B'} \sum_{j=1}^{B'} \left[\|u(x_j^{\mathrm{ref}}) - m\|^2 + \|v(t_j^{\mathrm{ref}}) - m\|^2\right]

2.2 Pairwise Vector Loss (PVL)

Intuition: Enforce for each reference image–caption pair that their corresponding difference vectors are matched, guaranteeing local cross-modal alignment.

Loss:

Lpvl=1Bj=1Bu(xjref)v(tjref)2L_{\mathrm{pvl}} = \frac{1}{B'} \sum_{j=1}^{B'} \|u(x_j^{\mathrm{ref}}) - v(t_j^{\mathrm{ref}})\|^2

2.3 Combined Fine-tuning Objective

Starting from a contrastive loss LclL_{\mathrm{cl}} (as in FLYP) on the primary in-distribution data,

Lcl=12Bi=1Blogexp(fft(xi)gft(ti)/τ)kexp(fft(xi)gft(tk)/τ)12Bi=1Blogexp(fft(xi)gft(ti)/τ)kexp(fft(xk)gft(ti)/τ)L_{\mathrm{cl}} = - \frac{1}{2B} \sum_{i=1}^B \log \frac{\exp(f_{\mathrm{ft}}(x_i) \cdot g_{\mathrm{ft}}(t_i)/\tau)}{\sum_k \exp(f_{\mathrm{ft}}(x_i) \cdot g_{\mathrm{ft}}(t_k)/\tau)} - \frac{1}{2B} \sum_{i=1}^B \log \frac{\exp(f_{\mathrm{ft}}(x_i) \cdot g_{\mathrm{ft}}(t_i)/\tau)}{\sum_k \exp(f_{\mathrm{ft}}(x_k) \cdot g_{\mathrm{ft}}(t_i)/\tau)}

the total objective is given by

Lfinal=Lcl+λ(Lavl+Lpvl)L_{\mathrm{final}} = L_{\mathrm{cl}} + \lambda \cdot (L_{\mathrm{avl}} + L_{\mathrm{pvl}})

where λ\lambda is a regularization weight.

3. Preservation of Embedding Space Geometry

The design of DiVE targets two distinct aspects of geometric preservation:

  • Global Consistency (AVL): The average vector loss compels all difference vectors to concentrate around a single mean vector mm, effecting a near-uniform translation for all embedding points. This ensures the global configuration is preserved—no embeddings drift unevenly, maintaining pairwise relationships across the entire dataset.
  • Local Alignment (PVL): The pairwise vector loss enforces that for every reference image–caption pair, the two modalities shift identically during fine-tuning, retaining subspace alignment critical for consistent cross-modal matching.

These constraints result in the updated embeddings being nearly an isometry (rigid translation) of the original ones.

4. Comparison to Contrastive-Replay Fine-tuning Methods

Prior robust fine-tuning strategies, such as FLYP and ARF, rely on contrastive learning over a reference set but lack explicit geometric controls. Empirically, these methods reduce the Representation Similarity Analysis (RSA) correlation—an indicator of preserved rank-order pairwise distances—between pre-trained and fine-tuned models to 0.830.85\sim0.83{-}0.85 on datasets like Flickr8K. This geometric distortion directly impairs OOD and zero-shot performance due to the disruption of transferrable structure on the hypersphere.

In contrast, DiVE retains RSA correlation 0.98\approx0.98, corresponding to an almost rigid translation of the embedding space. This geometric invariance is the principal mechanism underlying DiVE’s enhanced generalization under domain shift and label-mismatch regimes.

5. Experimental Evaluation

DiVE was evaluated on a suite of standard ID, OOD, and zero-shot benchmarks with a ViT-B/16 backbone. The results demonstrate:

Method ID (ImageNet) OOD avg ZS avg
Vanilla FT 81.3% 54.9% 35.7%
FLYP / ARF 82–83% ~59–61% 50–56%
DiVE 82.5% 63.2% 63.7%
  • In-Distribution (ID): Evaluated on ImageNet, iWildCam (macro-F₁), and FMoW (worst-region OOD).
  • Out-of-Distribution (OOD): ImageNet derivatives (V2, R, A, Sketch, ObjectNet), iWildCam (held-out cameras), FMoW (held-out regions).
  • Zero-Shot (ZS): Ten public classification benchmarks, including Caltech-101, Flowers, Food-101, SUN397, etc., assessed with the prompt “a photo of a [class]”.

DiVE achieves second-best ID accuracy, but distinctly superior performance in OOD and zero-shot settings, exceeding FLYP by +3–4 percentage points in OOD and +14 in ZS metrics.

6. Computational Considerations and Limitations

  • Resource Demand: DiVE incurs higher computational cost due to forward/backward passes over the reference set each batch. On a ViT-B/16 + ImageNet configuration, epoch-wise training time and GPU memory usage are increased by 1.6×\sim1.6\times and 2.7×\sim2.7\times, respectively.
  • Hyperparameter Sensitivity: Effective deployment is contingent on selecting suitable values for λ\lambda (regularization weight), α\alpha (moving-average decay), and the reference set size. Diminished performance is observed with small reference sets.
  • Scalability: Since each batch requires operations over the full reference set, applying DiVE to very large-scale models is computationally intensive. Reducing overhead via alternatives (e.g., gradient-match surrogates for difference vectors) remains an open research direction.
  • Theoretical Understanding: A comprehensive theoretical account explaining the optimality of uniform difference vectors for generalization is still forthcoming.

7. Broader Implications

By constraining the difference vectors rather than re-optimizing pairwise similarities, DiVE enables fine-tuning on new data without destroying the geometric structure that underpins OOD and zero-shot transfer robustness. This method elucidates that uniform, nearly isometric transformation of the embedding manifold suffices to balance adaptation with generalization, suggesting future robust adaptation research may benefit from geometric regularization paradigms. Further scaling and theoretical advances are pending, influencing applicability to ever-larger multi-modal and unimodal pre-trained models.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Difference Vector Equalization (DiVE).