arXiv:2603.17067v1 Announce Type: cross Abstract: Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex […]
Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
arXiv:2603.17111v1 Announce Type: cross Abstract: Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and […]
Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems
arXiv:2603.17176v1 Announce Type: cross Abstract: Retrieval augmented generation systems have become an integral part of everyday life. Whether in internet search engines, email systems, or service chatbots, these systems are based on context retrieval and answer generation with large language models. With their spread, also the security vulnerabilities increase. Attackers become increasingly focused on these […]
Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection
arXiv:2603.16970v1 Announce Type: cross Abstract: Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual […]
Dependence Fidelity and Downstream Inference Stability in Generative Models
arXiv:2603.17041v1 Announce Type: cross Abstract: Recent advances in generative AI have led to increasingly realistic synthetic data, yet evaluation criteria remain focused on marginal distribution matching. While these diagnostics assess local realism, they provide limited insight into whether a generative model preserves the multivariate dependence structures governing downstream inference. We introduce covariance-level dependence fidelity as […]
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
arXiv:2603.17476v1 Announce Type: cross Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for […]
EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
arXiv:2603.16947v1 Announce Type: cross Abstract: Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for […]
WebPII: Benchmarking Visual PII Detection for Computer-Use Agents
arXiv:2603.17357v1 Announce Type: cross Abstract: Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark […]
SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization
arXiv:2603.17219v1 Announce Type: cross Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(mathbfx)$ varies non-linearly across acquisition protocols while the conditional anatomy $P(mathbfy|mathbfx)$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., […]
Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation
arXiv:2603.17295v1 Announce Type: cross Abstract: Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive […]
Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare
arXiv:2603.17419v1 Announce Type: cross Abstract: Autonomous AI agents powered by large language models are being deployed in production with capabilities including shell execution, file system access, database queries, and multi-party communication. Recent red teaming research demonstrates that these agents exhibit critical vulnerabilities in realistic settings: unauthorized compliance with non-owner instructions, sensitive information disclosure, identity spoofing, […]
Adversarial attacks against Modern Vision-Language Models
arXiv:2603.16960v1 Announce Type: cross Abstract: We study adversarial robustness of open-source vision-language model (VLM) agents deployed in a self-contained e-commerce environment built to simulate realistic pre-deployment conditions. We evaluate two agents, LLaVA-v1.5-7B and Qwen2.5-VL-7B, under three gradient-based attacks: the Basic Iterative Method (BIM), Projected Gradient Descent (PGD), and a CLIP-based spectral attack. Against LLaVA, all […]