May 13, 2026 – Page 3 – dijee Pharma Intelligence

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

arXiv:2605.11538v1 Announce Type: cross Abstract: Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed […]

May 13, 2026

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

arXiv:2605.10998v1 Announce Type: cross Abstract: Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) […]

May 13, 2026

An Executable Benchmarking Suite for Tool-Using Agents

arXiv:2605.11030v1 Announce Type: cross Abstract: Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a […]

May 13, 2026

Deploying Self-Supervised Learning for Real Seismic Data Denoising

arXiv:2605.11109v1 Announce Type: cross Abstract: Self-supervised learning (SSL) has emerged as a promising approach to seismic data denoising as it does not require clean reference data. In this work, the deployment of the Noisy-as-Clean (NaC) method was evaluated for real seismic data denoising under controlled conditions. Two independent seismic acquisitions, each comprising noisy and filtered […]

May 13, 2026

Exploring Token-Space Manipulation in Latent Audio Tokenizers

arXiv:2605.11192v1 Announce Type: cross Abstract: Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set […]

May 13, 2026

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

arXiv:2605.11169v1 Announce Type: new Abstract: Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM […]

May 13, 2026

Rethinking external validation for the target population: Capturing patient-level similarity with a generative model

arXiv:2605.11284v1 Announce Type: cross Abstract: Background: External validation is essential for assessing the transportability of predictive models. However, its interpretation is often confounded by differences between external and development populations. This study introduces a framework to distinguish model deficiencies from case-mix effects. Method: We propose a framework that quantifies each external patient’s similarity to the […]

May 13, 2026

Architecture Determines Observability of Transformers

arXiv:2604.24801v3 Announce Type: replace-cross Abstract: Autoregressive transformers make confident errors that output-confidence monitoring cannot catch. Activation monitors catch them only when training leaves a decision-quality signal beyond what the output already exposes. This signal is an architectural property of the trained model, fixed upstream of any monitor. Controlling for output confidence removes 60.3% of the […]

May 13, 2026

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

arXiv:2605.11368v1 Announce Type: cross Abstract: We study the application of recent Edit Flows for inference-time reward control for DNA sequence generation. Unlike most reward-guided DNA generation frameworks, which operate on fixed-length sequence spaces, Edit Flows have a potential to generate variable-length DNA through biologically plausible insertion, deletion, and substitution operations. In particular, we propose Local […]

May 13, 2026

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

arXiv:2605.11182v1 Announce Type: new Abstract: On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model’s own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies […]

May 13, 2026

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

arXiv:2605.11467v1 Announce Type: cross Abstract: Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce **ProFIL** (**Pro**be-**Fil**tered Reinforcement Learning) to *reduce theater, increase chain-of-thought faithfulness, and shrink […]

May 13, 2026

Modulation Consistency-based Contrastive Learning for Self-Supervised Automatic Modulation Classification

arXiv:2605.11875v1 Announce Type: cross Abstract: Deep learning-based AMC methods have achieved remarkable performance, but their practical deployment remains constrained by the high cost of labeled data. Although self-supervised learning (SSL) reduces the reliance on labels, existing SSL-based AMC methods often rely on task-agnostic pretext objectives misaligned with modulation classification, leading to representations entangled with nuisance […]

May 13, 2026

Subscribe for Updates