arXiv:2603.17041v1 Announce Type: cross Abstract: Recent advances in generative AI have led to increasingly realistic synthetic data, yet evaluation criteria remain focused on marginal distribution matching. While these diagnostics assess local realism, they provide limited insight into whether a generative model preserves the multivariate dependence structures governing downstream inference. We introduce covariance-level dependence fidelity as […]
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
arXiv:2603.17476v1 Announce Type: cross Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for […]
EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
arXiv:2603.16947v1 Announce Type: cross Abstract: Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for […]
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
arXiv:2603.17357v1 Announce Type: cross Abstract: Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark […]
SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization
arXiv:2603.17219v1 Announce Type: cross Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(mathbfx)$ varies non-linearly across acquisition protocols while the conditional anatomy $P(mathbfy|mathbfx)$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., […]
Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation
arXiv:2603.17295v1 Announce Type: cross Abstract: Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive […]
Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare
arXiv:2603.17419v1 Announce Type: cross Abstract: Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that these agents exhibit critical vulnerabilities in realistic settings: unauthorized compliance with non-owner instructions, sensitive information disclosure, identity spoofing, […]
Adversarial attacks against Modern Vision-Language Models
arXiv:2603.16960v1 Announce Type: cross Abstract: We study adversarial robustness of open-source vision-language model (VLM) agents deployed in a self-contained e-commerce environment built to simulate realistic pre-deployment conditions. We evaluate two agents, LLaVA-v1.5-7B and Qwen2.5-VL-7B, under three gradient-based attacks: the Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and a CLIP-based spectral attack. Against LLaVA, all […]
Empirical Recipes for Efficient and Compact Vision-Language Models
arXiv:2603.16987v1 Announce Type: cross Abstract: Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based […]
Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts
arXiv:2603.17070v1 Announce Type: cross Abstract: In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around […]
Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents
arXiv:2603.17150v1 Announce Type: cross Abstract: Agentic AI systems can now generate code with remarkable fluency, but a fundamental question remains: emphdoes the generated code actually do what the user intended? The gap between informal natural language requirements and precise program behavior — the emphintent gap — has always plagued software engineering, but AI-generated code amplifies […]
A scalable neural bundle map for multiphysics prediction in lithium-ion battery across varying configurations
arXiv:2603.17209v1 Announce Type: cross Abstract: Efficient and accurate prediction of Multiphysics evolution across diverse cell geometries is fundamental to the design, management and safety of lithium-ion batteries. However, existing computational frameworks struggle to capture the coupled electrochemical, thermal, and mechanical dynamics across diverse cell geometries and varying operating conditions. Here, we present a Neural Bundle […]