Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 231 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Learning Dynamics of VLM Finetuning (2510.11978v1)

Published 13 Oct 2025 in cs.LG and cs.AI

Abstract: Preference-based finetuning of vision--LLMs (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics--aware optimization} and introduce \textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and exploits the training trajectory. \textbf{Stage 1} performs supervised finetuning with \textbf{gentle negatives}: \textbf{low-weight smoothed supervision} that regularizes the base policy and curbs overconfidence without explicit penalties. \textbf{Stage 2} applies a DPO objective in which the \textbf{negative term is scaled by a cooling weight} computed from the model's \textbf{average token log-probability} on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbf{on-policy negatives} and allow \textbf{mixed negatives} by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $\Delta!\log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbf{more stable optimization}, \textbf{better calibration}, and \textbf{higher pairwise win-rates} than SFT-only and vanilla DPO, while \textbf{converging in fewer steps}. Ablations isolate the \textbf{cooling-weight mechanism} as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbf{smoothing learning dynamics before cooling preferences} is a simple, general principle for robust VLM alignment.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.