MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

arXiv:2512.12634v3 Announce Type: replace Abstract: Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using […]

AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects

arXiv:2605.12845v1 Announce Type: cross Abstract: Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal […]

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

arXiv:2605.12857v1 Announce Type: cross Abstract: Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors’ air-gapped security requirements, and cannot be trained on vendors’ proprietary RTL codebases, leaving valuable internal data unused. Recent […]

Does the motor cortex draw on a wire plane?

arXiv:2603.03337v2 Announce Type: replace Abstract: The two-thirds power law of human motor control ($v propto kappa^-1/3$) is geometrically equivalent to constant equi-affine speed. In classical differential geometry, however, the equi-affine metric is not a tensor: it depends on acceleration, which does not transform covariantly under arbitrary coordinate changes. To recover tensorial behavior, one must either […]

RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems

arXiv:2605.12895v1 Announce Type: cross Abstract: Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through […]

Position: Agentic AI System Is a Foreseeable Pathway to AGI

arXiv:2605.12966v1 Announce Type: new Abstract: Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast […]

When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

arXiv:2605.12947v1 Announce Type: cross Abstract: LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This raises a statistical challenge because deployment-time evaluator scores are adaptively generated and repeatedly monitored, yet the likelihood models […]

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

arXiv:2605.12975v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries […]

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

arXiv:2605.11679v2 Announce Type: replace Abstract: In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses […]

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

arXiv:2605.13010v1 Announce Type: cross Abstract: We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small […]

Useful Memories Become Faulty When Continuously Updated by LLMs

arXiv:2605.12978v1 Announce Type: new Abstract: Learning from past experience benefits from two complementary forms of memory: episodic traces — raw trajectories of what happened — and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844