DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving

arXiv:2506.00819v2 Announce Type: replace-cross Abstract: End-to-end autonomous driving systems map sensor data directly to control commands, but remain opaque, lack interpretability, and offer no formal safety guarantees. While recent vision-language-guided reinforcement learning (RL) methods introduce semantic feedback, they often rely on static prompts and fixed objectives, limiting adaptability to dynamic driving scenes. We present DriveMind, […]

UniPrompt-CL: Sustainable Continual Learning in Medical AI with Unified Prompt Pools

arXiv:2508.10954v2 Announce Type: replace-cross Abstract: Modern AI models are typically trained on static datasets, limiting their ability to continuously adapt to rapidly evolving real-world environments. While continual learning (CL) addresses this limitation, most CL methods are designed for natural images and often underperform or fail to transfer to medical data due to domain bias, institutional […]

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

arXiv:2510.15346v2 Announce Type: replace-cross Abstract: Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models’ next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form […]

Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

arXiv:2511.18507v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) deployed on devices must adapt to continuously changing visual scenarios such as variations in background and perspective, to effectively perform complex visual tasks. To investigate catastrophic forgetting under real-world scenario shifts, we construct a multimodal visual understanding dataset (MSVQA), covering four distinct scenarios and perspectives: […]

FCMBench: The First Large-scale Financial Credit Multimodal Benchmark for Real-world Applications

arXiv:2601.00150v3 Announce Type: replace-cross Abstract: FCMBench is the first large-scale and privacy-compliant multimodal benchmark for real-world financial credit applications, covering tasks and robustness challenges from domain specific workflows and constraints. The current version of FCMBench covers 26 certificate types, with 5198 privacy-compliant images and 13806 paired VQA samples. It evaluates models on Perception and Reasoning […]

CCMamba: Topologically-Informed Selective State-Space Networks on Combinatorial Complexes for Higher-Order Graph Learning

arXiv:2601.20518v2 Announce Type: replace-cross Abstract: Topological deep learning has emerged as a powerful paradigm for modeling higher-order relational structures beyond pairwise interactions that standard graph neural networks fail to capture. While combinatorial complexes (CCs) offer a unified topological foundation for the higher-order graph learning, existing topological deep learning methods rely heavily on local message passing […]

Variation-aware Flexible 3D Gaussian Editing

arXiv:2602.11638v3 Announce Type: replace-cross Abstract: Indirect editing methods for 3D Gaussian Splatting (3DGS) have recently witnessed significant advancements. These approaches operate by first applying edits in the rendered 2D space and subsequently projecting the modifications back into 3D. However, this paradigm inevitably introduces cross-view inconsistencies and constrains both the flexibility and efficiency of the editing […]

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

arXiv:2603.05772v2 Announce Type: replace-cross Abstract: Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted […]

One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

arXiv:2603.11545v2 Announce Type: replace-cross Abstract: We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than […]

How animal movement influences wildlife-vehicle collision risk: a mathematical framework for range-resident species

arXiv:2507.17058v2 Announce Type: replace Abstract: Wildlife-vehicle collisions (WVC) threaten both biodiversity and human safety worldwide. Despite empirical efforts to characterize the major determinants of WVC risk and optimize mitigation strategies, we still lack a theoretical framework linking traffic, landscape, and individual movement features to collision risk. Here, we introduce such a framework by leveraging recent […]

DART: Input-Difficulty-AwaRe Adaptive Threshold for Early-Exit DNNs

arXiv:2603.12269v1 Announce Type: cross Abstract: Early-exit deep neural networks enable adaptive inference by terminating computation when sufficient confidence is achieved, reducing cost for edge AI accelerators in resource-constrained settings. Existing methods, however, rely on suboptimal exit policies, ignore input difficulty, and optimize thresholds independently. This paper introduces DART (Input-Difficulty-Aware Adaptive Threshold), a framework that overcomes […]

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

arXiv:2602.07801v3 Announce Type: replace-cross Abstract: In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844