arXiv:2605.27178v1 Announce Type: cross Abstract: We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel […]
EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
arXiv:2605.27332v1 Announce Type: cross Abstract: Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we […]
PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic
arXiv:2511.20586v4 Announce Type: replace Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), […]
Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation
arXiv:2602.11799v2 Announce Type: replace Abstract: Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data […]
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
arXiv:2605.06213v2 Announce Type: replace Abstract: Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under […]
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
arXiv:2605.24785v2 Announce Type: replace Abstract: Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena […]
Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks
arXiv:2506.03627v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic […]
Searching the Internet for Challenging Benchmarks at Scale
arXiv:2509.26619v3 Announce Type: replace-cross Abstract: Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses — and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to […]
The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP
arXiv:2605.26415v1 Announce Type: cross Abstract: Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We […]
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
arXiv:2605.13779v2 Announce Type: replace-cross Abstract: We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model […]
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
arXiv:2605.26316v1 Announce Type: cross Abstract: Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others’ actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and […]
Personalized Generative Models for Contextual Debiasing
arXiv:2605.26353v1 Announce Type: cross Abstract: Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on […]