Rethinking the Rank Threshold for LoRA Fine-Tuning

arXiv:2605.03724v1 Announce Type: cross Abstract: A recent landscape analysis of LoRA fine-tuning in the neural tangent kernel regime establishes a sufficient condition $r(r+1)/2 > KN$ on the LoRA rank $r$ for the absence of spurious local minima under squared-error loss, prescribing $r geq 12$ on canonical few-shot RoBERTa setups. The condition is stated for general […]

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models

arXiv:2605.03936v1 Announce Type: cross Abstract: Conceptual analysis — proposing definitions and refining them through counterexamples — is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition, another repairs the definition, and the process repeats. Across 20 […]

Closed-Loop Vision-Language Planning for Multi-Agent Coordination

arXiv:2502.10148v3 Announce Type: replace Abstract: Cooperative multi-agent reinforcement learning (MARL) struggles with sample efficiency, interpretability, and generalization. While Large Language Models (LLMs) offer powerful planning capabilities, their application has been hampered by a reliance on text-only inputs and a failure to handle the non-Markovian, partially observable nature of multi-agent tasks. We introduce COMPASS, a multi-agent […]

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

arXiv:2604.14709v3 Announce Type: replace Abstract: Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived […]

Adaptive Long-term Embedding with Denoising and Augmentation for Recommendation

arXiv:2504.13614v2 Announce Type: replace-cross Abstract: The rapid growth of the internet has made personalized recommendation systems indispensable. Graph-based sequential recommendation systems, powered by Graph Neural Networks (GNNs), effectively capture complex user-item interactions but often face challenges such as noise and static representations. In this paper, we introduce the Adaptive Long-term Embedding with Denoising and Augmentation […]

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

arXiv:2604.07039v2 Announce Type: replace-cross Abstract: Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should […]

Benchmarking Single-Pose Docking, Consensus Rescoring, and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock, AutoDock-GPU, GNINA, and DiffDock-NMDN

arXiv:2605.01681v2 Announce Type: replace-cross Abstract: Virtual screening performance depends heavily on the chosen docking and scoring methods. Recent AI-based tools such as DiffDock and NMDN have reported strong benchmark results, but their practical utility on realistic, experimentally-derived datasets remains unclear. Here we perform a large-scale evaluation on the LIT-PCBA library (15 targets, 578,295 ligand-target pairs […]

ProgramBench: Can Language Models Rebuild Programs From Scratch?

arXiv:2605.03546v1 Announce Type: cross Abstract: Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited […]

Multi-Agent Strategic Games with LLMs

arXiv:2605.03604v1 Announce Type: cross Abstract: This paper asks whether large language models (LLMs) can be used to study the strategic foundations of conflict and cooperation. I introduce LLMs as experimental subjects in a repeated security dilemma and evaluate whether they reproduce canonical mechanisms from international relations theory. The baseline game is extended along three theoretically […]

SERE: Structural Example Retrieval for Enhancing LLMs in Event Causality Identification

arXiv:2605.03701v1 Announce Type: cross Abstract: Event Causality Identification (ECI) requires models to determine whether a given pair of events in a context exhibits a causal relationship. While Large Language Models (LLMs) have demonstrated strong performance across various NLP tasks, their effectiveness in ECI remains limited due to biases in causal reasoning, often leading to overprediction […]

RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

arXiv:2605.03821v1 Announce Type: cross Abstract: Existing robot video world models are typically trained with low-level objectives such as reconstruction and perceptual similarity, which are poorly aligned with the capabilities that matter most for robot decision making, including instruction following, manipulation success, and physical plausibility. They also suffer from error accumulation in long-horizon autoregressive prediction. We […]

Atomic Fact-Checking Increases Clinician Trust in Large Language Model Recommendations for Oncology Decision Support: A Randomized Controlled Trial

arXiv:2605.03916v1 Announce Type: cross Abstract: Question: Does atomic fact-checking, which decomposes AI treatment recommendations into individually verifiable claims linked to source guideline documents, increase clinician trust compared to traditional explainability approaches? Findings: In this randomized trial of 356 clinicians generating 7,476 trust ratings, atomic fact-checking produced a large effect on trust (Cohen’s d = 0.94), […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844