Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

arXiv:2512.20629v1 Announce Type: cross Abstract: This study proposes a multi-agent language framework that enables continual strategy evolution without fine-tuning the language model’s parameters. The core

LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

arXiv:2512.21243v1 Announce Type: cross Abstract: Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete

Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

arXiv:2512.15249v2 Announce Type: replace-cross Abstract: Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident

One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

arXiv:2512.20957v1 Announce Type: cross Abstract: Locating the files and functions requiring modification in large open-source software (OSS) repositories is challenging due to their scale and

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

arXiv:2512.16378v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

December 24, 2025

arXiv:2512.15649v2 Announce Type: replace-cross
Abstract: The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model’s ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-processed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844