Diabetic Retinopathy Classification using Downscaling Algorithms and Deep Learning

arXiv:2605.11430v1 Announce Type: cross Abstract: Diabetic Retinopathy (DR) is an art and science of recording and classifying the retinal images of a diabetic patient. DR

Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

arXiv:2605.11235v1 Announce Type: cross Abstract: In LLM Reinforcement Fine-Tuning (RFT), curriculum learning drives both efficiency and performance. Yet, current methods externalize curriculum judgment via handcrafted

Epistemic Uncertainty for Test-Time Discovery

arXiv:2605.11328v1 Announce Type: cross Abstract: Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which

ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching

arXiv:2605.11048v1 Announce Type: cross Abstract: Existing imitation learning methods enable robots to interact autonomously with the physical environment. However, contact-rich manipulation tasks remain a significant

Interpretability Can Be Actionable

arXiv:2605.11161v1 Announce Type: cross Abstract: Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

May 13, 2026

arXiv:2605.10998v1 Announce Type: cross
Abstract: Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI’s fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only $1.7, $1.7, $0.3, and $0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844