arXiv:2601.17868v2 Announce Type: replace-cross
Abstract: Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.
Inside the marketplace powering bespoke AI deepfakes of real women
Civitai—an online marketplace for buying and selling AI-generated content, backed by the venture capital firm Andreessen Horowitz—is letting users buy custom instruction files for generating


