FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

arXiv:2605.27178v1 Announce Type: cross Abstract: We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel […]

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

arXiv:2605.27332v1 Announce Type: cross Abstract: Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we […]

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

arXiv:2511.20586v4 Announce Type: replace Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), […]

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

arXiv:2602.11799v2 Announce Type: replace Abstract: Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data […]

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

arXiv:2605.24785v2 Announce Type: replace Abstract: Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena […]

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

arXiv:2506.03627v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic […]

Searching the Internet for Challenging Benchmarks at Scale

arXiv:2509.26619v3 Announce Type: replace-cross Abstract: Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses — and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to […]

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

arXiv:2605.26415v1 Announce Type: cross Abstract: Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We […]

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

arXiv:2605.13779v2 Announce Type: replace-cross Abstract: We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model […]

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

arXiv:2605.26316v1 Announce Type: cross Abstract: Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others’ actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and […]

Personalized Generative Models for Contextual Debiasing

arXiv:2605.26353v1 Announce Type: cross Abstract: Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844