Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

arXiv:2412.14613v3 Announce Type: replace-cross Abstract: Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task, such as image captioning. While the overall evaluation is essential for any task, the criteria […]

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

arXiv:2603.02748v2 Announce Type: replace-cross Abstract: Despite the success of Large Vision–Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this […]

From Semantic To Instance: A Semi-Self-Supervised Learning Approach

arXiv:2506.16563v2 Announce Type: replace-cross Abstract: Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is […]

Speed3R: Sparse Feed-forward 3D Reconstruction Models

arXiv:2603.08055v1 Announce Type: cross Abstract: While recent feed-forward 3D reconstruction models accelerate 3D reconstruction by jointly inferring dense geometry and camera poses in a single pass, their reliance on dense attention imposes a quadratic complexity, creating a prohibitive computational bottleneck that severely limits inference speed. To resolve this, we introduce Speed3R, an end-to-end trainable model […]

The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs

arXiv:2510.17057v2 Announce Type: replace-cross Abstract: Chain-of-Thought (CoT) monitoring has emerged as a compelling method for detecting harmful behaviors such as reward hacking for reasoning models, under the assumption that models’ reasoning processes are informative of such behaviors. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned […]

Slurry-as-a-Service: A Modest Proposal on Scalable Pluralistic Alignment for Nutrient Optimization

arXiv:2603.02420v2 Announce Type: replace-cross Abstract: Pluralistic alignment has emerged as a promising approach for ensuring that large language models (LLMs) faithfully represent the diversity, nuance, and conflict inherent in human values. In this work, we study a high-stakes deployment context – mulching – where automated systems transform selected individuals into nutrient-rich slurry for the dual […]

ReDepth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

arXiv:2512.17908v2 Announce Type: replace-cross Abstract: Monocular depth estimation remains challenging, as foundation models such as Depth Anything V2 (DA-V2) struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing foundation models with the powerful priors of large-scale 2D diffusion […]

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

arXiv:2603.08034v1 Announce Type: cross Abstract: Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring […]

Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

arXiv:2512.10416v3 Announce Type: replace-cross Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ […]

Relay transitions and invasion thresholds in multi-strain rumor models: a chemical reaction network approach

arXiv:2603.01186v2 Announce Type: replace-cross Abstract: The historical quest for unifying the concepts and methods of Chemical Reaction Networks theory (CRNT), Mahematical Epidemiology (ME) and ecology has received increased attention in the last years and has led in particular to the development of the symbolic package EpidCRN, for automatic analysis of positive ODEs, which implements tools […]

Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

arXiv:2602.15895v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human […]

GCGNet: Graph-Consistent Generative Network for Time Series Forecasting with Exogenous Variables

arXiv:2603.08032v1 Announce Type: cross Abstract: Exogenous variables offer valuable supplementary information for predicting future endogenous variables. Forecasting with exogenous variables needs to consider both past-to-future dependencies (i.e., temporal correlations) and the influence of exogenous variables on endogenous variables (i.e., channel correlations). This is pivotal when future exogenous variables are available, because they may directly affect […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844