arXiv:2606.09064v1 Announce Type: cross
Abstract: Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose textbfCoVER, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to textbfSee More by dynamically gathering query-expanded visual evidence, and textbfThink Deeper by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.
Within-person modeling of postprandial glucose using multimodal wearable data
The widespread adoption of continuous glucose monitoring (CGM) and wearable sensing technologies has enabled large-scale collection of high-resolution physiological and behavioral data in real-world settings.
