arXiv:2601.17260v1 Announce Type: cross
Abstract: Direct Preference Optimization (DPO) is often tuned as if increasing alignment pressure (controlled by $beta$) yields progressively “better” behavior. We instead treat $beta$ as a control parameter and densely sweep it for three 7B open-weight families under a fixed DPO recipe. In Mistral, capability is sharply non-monotonic: aggregated logic-probe margins become positive only in a narrow band near $beta approx 10^-2$ and revert outside it, with boundary points that are seed-sensitive. Across architectures under the same sweep, we observe qualitatively different response modes: sharp reorganization in Mistral, selective changes in Llama, and smooth trade-offs in Qwen. Critically, the DPO preference margin can anticorrelate with reasoning capability (Pearson $r=-0.91$ for Llama logic), so margin-based selection can prefer capability-impaired models. Training path also matters: exposure to high $beta$ induces capability losses that persist even after $beta$ is reduced (hysteresis). These findings motivate capability-resolved evaluation across the $beta$ landscape rather than reliance on margins or aggregate benchmarks.
Infectious disease burden and surveillance challenges in Jordan and Palestine: a systematic review and meta-analysis
BackgroundJordan and Palestine face public health challenges due to infectious diseases, with the added detrimental factors of long-term conflict, forced relocation, and lack of resources.


