Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models

arXiv:2605.25601v1 Announce Type: cross Abstract: Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large language models could support such practice by simulating students with known skill components, enabling teachers to rehearse explanations, diagnoses, and instructional responses. For this purpose, however, the central requirement is neither to maximize benchmark […]

Persona-Model Collapse in Emergent Misalignment

arXiv:2605.12850v2 Announce Type: replace-cross Abstract: Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model’s internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using […]

Efficient Benchmarking Is Just Feature Selection and Multiple Regression

arXiv:2605.25773v1 Announce Type: cross Abstract: Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark’s questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by […]

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

arXiv:2605.25388v1 Announce Type: cross Abstract: Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we […]

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

arXiv:2605.25891v1 Announce Type: cross Abstract: We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model’s hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). […]

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

arXiv:2510.01389v2 Announce Type: replace-cross Abstract: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present textbfINSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $pi_0$-FAST as the underlying model, we extract […]

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

arXiv:2605.26026v1 Announce Type: cross Abstract: Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of […]

CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification

arXiv:2602.15811v2 Announce Type: replace-cross Abstract: Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously observed data or degrading validated performance. We study a task-incremental continual learning setting for chest radiograph classification under task-unknown inference, where heterogeneous chest X-ray datasets arrive sequentially and task […]

PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

arXiv:2502.10906v2 Announce Type: replace Abstract: Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and controlling robots using large language models (LLMs). In the content generation literature, there has been early work […]

When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure

arXiv:2605.23932v1 Announce Type: new Abstract: Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose textbftextscMed-Stress, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical […]

Chain-of-Thought Hijacking

arXiv:2510.26418v4 Announce Type: replace Abstract: Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective […]

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

arXiv:2605.15777v2 Announce Type: replace Abstract: Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844