arXiv:2511.18454v4 Announce Type: replace-cross Abstract: Assessing embryo fragmentation is crucial for predicting IVF success, yet manual grading is prone to subjectivity, and existing AI models struggle with clinical interpretability and segmentation errors. We propose AttnRegDeepLab, a Multi-Task Learning (MTL) framework designed to solve these challenges. The model enhances a DeepLabV3+ decoder with Attention Gates to […]
How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
arXiv:2602.00056v4 Announce Type: replace-cross Abstract: Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through […]
Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
arXiv:2604.01039v2 Announce Type: replace-cross Abstract: System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk […]
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
arXiv:2605.15416v2 Announce Type: replace-cross Abstract: Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model’s estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the […]
Post-Trained MoE Can Skip Half Experts via Self-Distillation
arXiv:2605.18643v2 Announce Type: replace-cross Abstract: Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such […]
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning
arXiv:2606.03328v2 Announce Type: replace-cross Abstract: Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy. We ask whether this conclusion survives once calibration impact is evaluated separately across distinct capability dimensions […]
Governance Controls for AI-Generated Test Artifacts in Autonomous Software Testing
arXiv:2606.08806v1 Announce Type: cross Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) are increasingly used in autonomous software testing; however, AI-generated test artifacts often suffer from hallucinations, compliance violations, security risks, and limited explainability. To enhance the reliability, transparency, and trustworthiness of AI-generated testing artifacts, this research introduces the concept of Governance-Aware Autonomous Testing […]
PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus
arXiv:2606.08938v1 Announce Type: cross Abstract: Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose textbfPACT (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm […]
CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon
arXiv:2606.09175v1 Announce Type: cross Abstract: Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server […]
PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments
arXiv:2606.09368v1 Announce Type: cross Abstract: Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development […]
SecureClaw: Clawing Back Control of LLM Agents
arXiv:2606.09549v1 Announce Type: cross Abstract: Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both […]
End-to-End Context Compression at Scale
arXiv:2606.09659v1 Announce Type: cross Abstract: Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to […]