The Hidden Power of Normalization: Exponential Capacity Control in Deep Neural Networks

arXiv:2511.00958v1 Announce Type: cross Abstract: Normalization methods are fundamental components of modern deep neural networks (DNNs). Empirically, they are known to stabilize optimization dynamics and

FeNN-DMA: A RISC-V SoC for SNN acceleration

arXiv:2511.00732v1 Announce Type: cross Abstract: Spiking Neural Networks (SNNs) are a promising, energy-efficient alternative to standard Artificial Neural Networks (ANNs) and are particularly well-suited to

Parameter Interpolation Adversarial Training for Robust Image Classification

arXiv:2511.00836v1 Announce Type: cross Abstract: Though deep neural networks exhibit superior performance on various tasks, they are still plagued by adversarial examples. Adversarial training has

Region-Aware Reconstruction Strategy for Pre-training fMRI Foundation Model

arXiv:2511.00443v1 Announce Type: cross Abstract: The emergence of foundation models in neuroimaging is driven by the increasing availability of large-scale and heterogeneous brain imaging datasets.

FlashEVA: Accelerating LLM inference via Efficient Attention

arXiv:2511.00576v1 Announce Type: cross Abstract: Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due

DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

October 29, 2025

arXiv:2510.23907v1 Announce Type: cross
Abstract: Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video’s educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844