arXiv:2601.05500v5 Announce Type: replace Abstract: Benchmarking the capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is not just limited to human preferences, but is also consequential even in safety critical domains such as medicine where […]
Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
arXiv:2604.01007v2 Announce Type: replace Abstract: AI agents increasingly operate over extended time horizons, yet their ability to retain, organize, and recall multimodal experiences remains a critical bottleneck. Building effective lifelong memory requires navigating a vast design space spanning architecture, retrieval strategies, prompt engineering, and data pipelines; this space is too large and interconnected for manual […]
TRACE: Transparent Web Reliability Assessment with Contextual Explanations
arXiv:2506.12072v4 Announce Type: replace-cross Abstract: In an era of AI-generated misinformation flooding the web, existing tools struggle to empower users with nuanced, transparent assessments of content credibility. They often default to binary (true/false) classifications without contextual justifications, leaving users vulnerable to disinformation. We address this gap by introducing TRACE: Transparent Reliability Assessment with Contextual Explanations, […]
SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
arXiv:2510.16219v3 Announce Type: replace-cross Abstract: Malicious agents pose significant threats to the reliability and decision-making capabilities of Multi-Agent Systems (MAS) powered by Large Language Models (LLMs). Existing defenses often fall short due to reactive designs or centralized architectures which may introduce single points of failure. To address these challenges, we propose SentinelNet, the first decentralized […]
Towards Faithful Reasoning in Comics for Small MLLMs
arXiv:2601.02991v2 Announce Type: replace-cross Abstract: Comic understanding presents a significant challenge for Multimodal Large Language Models (MLLMs), as the intended meaning of a comic often emerges from the joint interpretation of visual, textual, and social cues. This naturally motivates Chain-of-Thought (CoT) prompting, since explicit intermediate reasoning appears promising for integrating such heterogeneous signals. However, existing […]
Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation
arXiv:2603.17717v2 Announce Type: replace-cross Abstract: Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component […]
Learning to Play Blackjack: A Curriculum Learning Perspective
arXiv:2604.00076v2 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) agents often struggle with efficiency and performance in complex environments. We propose a novel framework that uses a Large Language Model (LLM) to dynamically generate a curriculum over available actions, enabling the agent to incorporate each action individually. We apply this framework to the game of Blackjack, […]
Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
arXiv:2604.01941v1 Announce Type: cross Abstract: Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model’s ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, […]
SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
arXiv:2604.01993v1 Announce Type: cross Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: […]
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
arXiv:2604.02091v1 Announce Type: cross Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often […]
Universal Hypernetworks for Arbitrary Models
arXiv:2604.02215v1 Announce Type: cross Abstract: Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the emphUniversal Hypernetwork (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator […]
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
arXiv:2604.02322v1 Announce Type: cross Abstract: Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training […]