arXiv:2510.11978v1 Announce Type: cross
Abstract: Preference-based finetuning of vision–language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as textbflearning-dynamics–aware optimization and introduce textbfCooling-Weighted DPO (CW-DPO), a two-stage recipe that explicitly models and exploits the training trajectory. textbfStage 1 performs supervised finetuning with textbfgentle negatives: textbflow-weight smoothed supervision that regularizes the base policy and curbs overconfidence without explicit penalties. textbfStage 2 applies a DPO objective in which the textbfnegative term is scaled by a cooling weight computed from the model’s textbfaverage token log-probability on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize textbfon-policy negatives and allow textbfmixed negatives by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $Delta!log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields textbfmore stable optimization, textbfbetter calibration, and textbfhigher pairwise win-rates than SFT-only and vanilla DPO, while textbfconverging in fewer steps. Ablations isolate the textbfcooling-weight mechanism as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that textbfsmoothing learning dynamics before cooling preferences is a simple, general principle for robust VLM alignment.
The one piece of data that could actually shed light on your job and AI
This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Within Silicon


