arXiv:2605.27470v1 Announce Type: cross Abstract: Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However, existing graph anomaly detection methods still face two key challenges: 1) fixed pipelines, which restrict their adaptability across different graph tasks under limited supervision; 2) weak evidence, which prevents them […]
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
arXiv:2605.28775v1 Announce Type: cross Abstract: Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the […]
AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models
arXiv:2602.18481v2 Announce Type: replace-cross Abstract: The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge evaluation toward interactive trading simulations. However, existing frameworks for evaluating real-time trading largely overlook a critical failure mode: the severe behavioral instability of LLMs in sequential decision-making under financial uncertainty. […]
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
arXiv:2605.15250v2 Announce Type: replace-cross Abstract: Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path – an absorbed MQA form – which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor […]
Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks
arXiv:2605.27595v1 Announce Type: cross Abstract: Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in […]
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
arXiv:2509.21128v2 Announce Type: replace Abstract: Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning […]
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
arXiv:2605.27570v1 Announce Type: new Abstract: Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations […]
Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking
arXiv:2605.27914v1 Announce Type: cross Abstract: Subjective evaluation of LLM behavior — empathy, restraint, calibrated emotional tone — is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target’s training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus […]
UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind
arXiv:2605.27721v1 Announce Type: cross Abstract: Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user’s perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user’s mental […]
FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
arXiv:2603.02702v3 Announce Type: replace Abstract: The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which […]
Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning
arXiv:2604.24938v3 Announce Type: replace-cross Abstract: Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer redundancy as an inherent structural property of pretrained networks, emphasizing importance criteria and search algorithms to identify removable layers. In this study, we empirically investigate depth pruning from a functional perspective. […]
Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis
arXiv:2605.12929v2 Announce Type: replace-cross Abstract: Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into a […]