arXiv:2604.08064v1 Announce Type: new Abstract: Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit […]
Comparative Evaluation of Embedding Representations for Financial News Sentiment Analysis
arXiv:2512.13749v2 Announce Type: replace-cross Abstract: Financial sentiment analysis enhances market understanding. However, standard Natural Language Processing (NLP) approaches encounter significant challenges when applied to small datasets. This study presents a comparative evaluation of embedding-based techniques for financial news sentiment classification in resource-constrained environments. Word2Vec, GloVe, and sentence transformer representations are evaluated in combination with gradient […]
Let the Agent Steer: Closed-Loop Ranking Optimization via Influence Exchange
arXiv:2603.27765v3 Announce Type: replace Abstract: Recommendation ranking is fundamentally an influence allocation problem: a sorting formula distributes ranking influence among competing factors, and the business outcome depends on finding the optimal “exchange rates” among them. However, offline proxy metrics systematically misjudge how influence reallocation translates to online impact, with asymmetric bias across metrics that a […]
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
arXiv:2505.11548v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) enhanced with Retrieval-Augmented Generation (RAG) have shown improved performance in generating accurate responses. However, the dependence on external knowledge bases introduces potential security vulnerabilities, particularly when these knowledge bases are publicly accessible and modifiable. While previous studies have exposed knowledge poisoning risks in RAG systems, existing […]
A systematic framework for generating novel experimental hypotheses from language models
arXiv:2408.05086v3 Announce Type: replace-cross Abstract: Neural language models (LMs) have been shown to capture complex linguistic patterns, yet their utility in understanding human language and more broadly, human cognition, remains debated. While existing work in this area often evaluates human-machine alignment, few studies attempt to translate findings from this enterprise into novel insights about humans. […]
WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
arXiv:2601.21872v2 Announce Type: replace Abstract: Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web […]
Tractable Uncertainty-Aware Meta-Learning
arXiv:2210.01881v2 Announce Type: replace-cross Abstract: Meta-learning is a popular approach for learning new tasks with limited data by leveraging the commonalities among different tasks. However, meta-learned models can perform poorly when context data is too limited, or when data is drawn from an out-of-distribution (OoD) task. Especially in safety-critical settings, this necessitates an uncertainty-aware approach […]
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding
arXiv:2503.10183v4 Announce Type: replace-cross Abstract: Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by contrastively reducing language biases or amplifying the weights of visual embedding during decoding. However, these […]
“I Said Things I Needed to Hear Myself”: Peer Support as an Emotional, Organisational, and Sociotechnical Practice in Singapore
arXiv:2506.09362v2 Announce Type: replace-cross Abstract: Peer support plays a vital role in expanding access to mental health care by providing empathetic, community-based support outside formal clinical systems. As digital platforms increasingly mediate such support, the design and impact of these technologies remain under-examined, particularly in Asian contexts. This paper presents findings from an interview study […]
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
arXiv:2510.15494v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) can generate code, but can they generate fast code for complex, real-world software systems? In this study, we investigate this question using a dataset of 65 tasks mined from performance-critical open-source Java projects. Unlike prior studies, which focused on algorithmic puzzles, we conduct experiments on actual […]
M-ArtAgent: Evidence-Based Multimodal Agent for Implicit Art Influence Discovery
arXiv:2604.07468v1 Announce Type: new Abstract: Implicit artistic influence, although visually plausible, is often undocumented and thus poses a historically constrained attribution problem: resemblance is necessary but not sufficient evidence. Most prior systems reduce influence discovery to embedding similarity or label-driven graph completion, while recent multimodal large language models (LLMs) remain vulnerable to temporal inconsistency and […]
Munkres’ General Topology Autoformalized in Isabelle/HOL
arXiv:2604.07455v1 Announce Type: new Abstract: We describe an experiment in LLM-assisted autoformalization that produced over 85,000 lines of Isabelle/HOL code covering all 39 sections of Munkres’ Topology (general topology, Chapters 2–8), from topological spaces through dimension theory. The LLM-based coding agents (initially ChatGPT 5.2 and then Claude Opus 4.6) used 24 active days for that. […]