arXiv:2601.03089v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore […]
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
arXiv:2605.26256v1 Announce Type: new Abstract: Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized […]
GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training
arXiv:2602.02518v2 Announce Type: replace-cross Abstract: Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We […]
Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
arXiv:2605.26252v1 Announce Type: new Abstract: Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory […]
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
arXiv:2603.04639v3 Announce Type: replace-cross Abstract: Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. […]
Can LLMs Introspect? A Reality Check
arXiv:2605.26242v1 Announce Type: new Abstract: Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish […]
Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)
arXiv:2605.27268v1 Announce Type: cross Abstract: Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric […]
Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability
arXiv:2603.11394v3 Announce Type: replace-cross Abstract: Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the “stick-or-switch” […]
Post-training makes large language models less human-like
arXiv:2605.07632v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training — the stage that […]
Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect
arXiv:2605.05248v4 Announce Type: replace-cross Abstract: AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from code representation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that in governed intelligent systems, […]
It’s Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty
arXiv:2605.27288v1 Announce Type: cross Abstract: Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model’s epistemic uncertainty at inference time. In this paper, […]
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
arXiv:2605.26955v1 Announce Type: cross Abstract: As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw […]