TransFIRA: Transfer Learning for Face Image Recognizability Assessment

arXiv:2510.06353v2 Announce Type: replace-cross Abstract: Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, […]

Route-Induced Density and Stability (RIDE): Controlled Intervention and Mechanism Analysis of Routing-Style Meta Prompts on LLM Internal States

arXiv:2603.29206v1 Announce Type: new Abstract: Routing is widely used to scale large language models, from Mixture-of-Experts gating to multi-model/tool selection. A common belief is that routing to a task “expert” activates sparser internal computation and thus yields more certain and stable outputs (the Sparsity–Certainty Hypothesis). We test this belief by injecting routing-style meta prompts as […]

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

arXiv:2603.29281v1 Announce Type: cross Abstract: A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation […]

A Systematic Framework for Enterprise Knowledge Retrieval: Leveraging LLM-Generated Metadata to Enhance RAG Systems

arXiv:2512.05411v2 Announce Type: replace-cross Abstract: In enterprise settings, efficiently retrieving relevant information from large and complex knowledge bases is essential for operational productivity and informed decision-making. This research presents a systematic empirical framework for metadata enrichment using large language models (LLMs) to enhance document retrieval in Retrieval-Augmented Generation (RAG) systems. Our approach employs a structured […]

Self-Improving Code Generation via Semantic Entropy and Behavioral Consensus

arXiv:2603.29292v1 Announce Type: cross Abstract: Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than […]

Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems

arXiv:2603.29211v1 Announce Type: new Abstract: In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B […]

Scaling Whole-Body Human Musculoskeletal Behavior Emulation for Specificity and Diversity

arXiv:2603.29332v1 Announce Type: cross Abstract: The embodied learning of human motor control requires whole-body neuro-actuated musculoskeletal dynamics, while the internal muscle-driven processes underlying movement remain inaccessible to direct measurement. Computational modeling offers an alternative, but inverse dynamics methods struggled to resolve redundant control from observed kinematics in the high-dimensional, over-actuated system. Forward imitation approaches based […]

How to Train Your Long-Context Visual Document Model

arXiv:2602.15257v2 Announce Type: replace-cross Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We […]

Extend3D: Town-Scale 3D Generation

arXiv:2603.29387v1 Announce Type: cross Abstract: In this paper, we propose Extend3D, a training-free pipeline for 3D scene generation from a single image, built upon an object-centric 3D generative model. To overcome the limitations of fixed-size latent spaces in object-centric models for representing wide scenes, we extend the latent space in the $x$ and $y$ directions. […]

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

arXiv:2603.29231v1 Announce Type: new Abstract: Existing benchmarks measure capability — whether a model succeeds on a single attempt — but production deployments require reliability — consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to […]

AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models

arXiv:2603.29410v1 Announce Type: cross Abstract: Pre-trained vision-language models (VLMs) exhibit strong zero-shot generalization but remain vulnerable to adversarial perturbations. Existing classification-guided adversarial fine-tuning methods often disrupt pre-trained cross-modal alignment, weakening visual-textual correspondence and degrading zero-shot performance. In this paper, we propose an Alignment-Guided Fine-Tuning (AGFT) framework that enhances zero-shot adversarial robustness while preserving the cross-modal […]

Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

arXiv:2603.18782v3 Announce Type: replace-cross Abstract: Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844