arXiv:2604.10135v2 Announce Type: replace-cross Abstract: Researchers have explored different ways to improve large language models (LLMs)’ capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through […]
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
arXiv:2506.03610v3 Announce Type: replace Abstract: Large Language Model (LLM) agents are reshaping the game industry, by enabling more intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt […]
CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
arXiv:2604.13561v1 Announce Type: cross Abstract: Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric […]
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
arXiv:2604.13630v1 Announce Type: cross Abstract: The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire […]
A Study of Failure Modes in Two-Stage Human-Object Interaction Detection
arXiv:2604.13448v1 Announce Type: cross Abstract: Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex […]
Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP
arXiv:2604.13481v1 Announce Type: cross Abstract: Here, we describe Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9), a climate emulator that leverages a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture to model the evolution of low-frequency internal atmospheric variability using latent diffusion. MDv0.9 was designed to forward-step at monthly mean timesteps in […]
Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators
arXiv:2604.13316v1 Announce Type: cross Abstract: Neural operators have emerged as fast surrogate models for physics simulations, yet they remain acutely vulnerable to adversarial perturbations, a critical liability for safety-critical digital twin deployments. We present a synergistic defense that combines active learning-based data generation with an input denoising architecture. The active learning component adaptively probes model […]
Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence
arXiv:2604.13414v1 Announce Type: cross Abstract: Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of […]
CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
arXiv:2604.13561v1 Announce Type: cross Abstract: Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric […]
Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection
arXiv:2604.13899v1 Announce Type: cross Abstract: Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on […]
Think in Sentences: Explicit Sentence Boundaries Enhance Language Model’s Capabilities
arXiv:2604.10135v2 Announce Type: replace-cross Abstract: Researchers have explored different ways to improve large language models (LLMs)’ capabilities via dummy token insertion in contexts. However, existing works focus solely on the dummy tokens themselves, but fail to leverage the inherent sentence-level structure of natural language. This is a critical oversight, as LLMs acquire linguistic capabilities through […]
SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment
arXiv:2604.13630v1 Announce Type: cross Abstract: The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire […]