arXiv:2510.11978v1 Announce Type: cross
Abstract: Preference-based finetuning of vision–language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as textbflearning-dynamics–aware optimization and introduce textbfCooling-Weighted DPO (CW-DPO), a two-stage recipe that explicitly models and exploits the training trajectory. textbfStage 1 performs supervised finetuning with textbfgentle negatives: textbflow-weight smoothed supervision that regularizes the base policy and curbs overconfidence without explicit penalties. textbfStage 2 applies a DPO objective in which the textbfnegative term is scaled by a cooling weight computed from the model’s textbfaverage token log-probability on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize textbfon-policy negatives and allow textbfmixed negatives by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $Delta!log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields textbfmore stable optimization, textbfbetter calibration, and textbfhigher pairwise win-rates than SFT-only and vanilla DPO, while textbfconverging in fewer steps. Ablations isolate the textbfcooling-weight mechanism as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that textbfsmoothing learning dynamics before cooling preferences is a simple, general principle for robust VLM alignment.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844