A Progressive Visual-Logic-Aligned Framework for Ride-Hailing Adjudication

arXiv:2603.17328v1 Announce Type: new Abstract: The efficient adjudication of responsibility disputes is pivotal for maintaining marketplace fairness. However, the exponential surge in ride-hailing volume renders manual review intractable, while conventional automated methods lack the reasoning transparency required for quasi-judicial decisions. Although Multimodal LLMs offer a promising paradigm, they fundamentally struggle to bridge the gap between […]

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

arXiv:2603.17651v1 Announce Type: cross Abstract: Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires […]

MOBODY: Model Based Off-Dynamics Offline Reinforcement Learning

arXiv:2506.08460v3 Announce Type: replace-cross Abstract: We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize […]

VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

arXiv:2603.17470v1 Announce Type: cross Abstract: Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting […]

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

arXiv:2603.17655v1 Announce Type: cross Abstract: Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, […]

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

arXiv:2603.04427v3 Announce Type: replace-cross Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic […]

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

arXiv:2603.16289v2 Announce Type: replace-cross Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning […]

EngGPT2: Sovereign, Efficient and Open Intelligence

arXiv:2603.16430v2 Announce Type: replace-cross Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens – less than Qwen3’s 36T or Llama3’s 15T – and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable […]

FINER: MLLMs Hallucinate under Fine-grained Negative Queries

arXiv:2603.17662v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and “what” questions. […]

QuantFL: Sustainable Federated Learning for Edge IoT via Pre-Trained Model Quantisation

arXiv:2603.17507v1 Announce Type: cross Abstract: Federated Learning (FL) enables privacy-preserving intelligence on Internet of Things (IoT) devices but incurs a significant carbon footprint due to the high energy cost of frequent uplink transmission. While pre-trained models are increasingly available on edge devices, their potential to reduce the energy overhead of fine-tuning remains underexplored. In this […]

DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

arXiv:2603.16806v2 Announce Type: replace-cross Abstract: To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, […]

Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

arXiv:2603.17368v1 Announce Type: new Abstract: Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs’ safety degradation occurs only after CoT is enabled, and this degradation is not observed when […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844