VidLaDA: Diffusion-Based Video LLM
- VidLaDA is a video large language model that uses bidirectional diffusion and full-sequence denoising to overcome causal masking biases and ensure global spatiotemporal context.
- It employs the MARS-Cache framework with adaptive anchor token selection and asynchronous cache refresh to reduce computational cost by over 12× while maintaining accuracy.
- VidLaDA demonstrates state-of-the-art performance in multi-task and long-video benchmarks and provides open access to its code and model weights for further research.
VidLaDA is a Video LLM (Video LLM) that applies bidirectional diffusion-based language modeling to complex video understanding tasks. Distinguished from traditional autoregressive approaches, VidLaDA implements full-sequence denoising with unrestricted bidirectional attention, and achieves scalable inference via the MARS-Cache framework. This paradigm addresses causal masking biases prevalent in prior models, resulting in more comprehensive spatiotemporal modeling and significantly enhanced decoding efficiency. VidLaDA demonstrates state-of-the-art performance in multi-task and long-video evaluation scenarios, with its code and model weights openly available for research (He et al., 25 Jan 2026).
1. Motivation and Theoretical Underpinnings
Autoregressive (AR) Video LLMs generate tokens sequentially under a strict causal mask, where the loss is given by
with representing previously generated response tokens, the encoded video, and the text prompt. This causal structure enforces for within the attention matrix, meaning each token can only attend to itself and its predecessors. Such “causal masking biases” manifest in two key phenomena:
- Visibility Imbalance: Early tokens act as attention sinks, receiving disproportionate focus, which impedes uniform contextual integration.
- Info-Max Gap: Future video-encoded features are inaccessible during decoding, limiting mutual information between the latent states and .
VidLaDA addresses these shortcomings by adopting a masked bidirectional diffusion LLM (DLM) where all tokens are denoised in parallel, and any token can attend to any other. The model replaces causal attention with
without application of any sequence-wise masking. This formulation ensures uniform visibility and enables global modeling over all space-time positions in the input-video plus textual prompt.
2. Diffusion Model Formulation
VidLaDA builds on discrete-masked diffusion modeling applied to token sequences encompassing both video and language modalities. For each diffusion step :
- Forward process (masking strategy): For the th position in the target response ,
yielding masked input sequences .
- Reverse process (denoising): The network predicts marginalized distributions for masked positions, iteratively reconstructing the full sequence.
- Objective: The training loss is
where is the mask indicator.
The bidirectional mechanism extends naturally to both continuous (video frame features) and discrete (language tokens) spaces, but actual implementation relies on token-based discrete diffusion.
3. MARS-Cache: Accelerating Bidirectional Diffusion Decoding
Naïve bidirectional denoising incurs quadratic complexity in the number of tokens, leading to prohibitive inference cost for long video sequences. VidLaDA introduces the MARS-Cache (Mask-based Asynchronous Refresh and Selective Cache) framework, which exploits empirical characteristics of deep transformer behavior in video:
- Frame-wise Chunk Attention: For each video frame , neighborhood is defined. Visual tokens are updated by localized attention over this space, drastically reducing computational cost from to .
- Adaptive Anchor Token Selection: Certain tokens (“anchors”) consistently receive global attention. These are found by proxy attention queries on subsampled sets (e.g., 32 or 128 tokens). Anchors in each frame and layer group are assigned top- attention globally, while others use local chunked attention.
- Asynchronous Cache Refresh: Layers are partitioned into groups, each with modality-specific refresh intervals . A pyramidal schedule is applied: shallow layers and textual embeddings refresh more frequently; deeper layers and visual features less so. Only when is the cache recomputed, otherwise the prior hidden state is reused.
- Complexity Management: These mechanisms enable >12× speedup over naive DLM bidirectional decoding, without sacrificing accuracy.
4. Evaluation: Performance and Benchmarks
VidLaDA’s efficiency and accuracy have been demonstrated on public video understanding and multi-modal reasoning benchmarks. Key findings include:
| Model | #Params | Frames | Video-MMMU | LongVideoBench | LVBench | EgoSchema | MVBench | MLVU_dev | MLVU_test | Video-MME |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL (AR)* | 7B | 64 | 47.4 | – | 45.3 | 65.0 | 69.6 | 62.8 | 45.3 | 63.9 |
| LLaVA-Video (AR)* | 7B | 64 | 37.1 | 58.2 | 41.5 | 57.3 | 58.6 | 70.8 | 50.4 | 63.7 |
| LLaDA-V (DLM)* | 8B | 32 | 43.3 | 58.6 | 36.4 | 57.9 | 53.1 | 59.4 | 44.1 | 56.4 |
| VidLaDA | 8B | 64 | 46.6 | 61.4 | 44.7 | 64.5 | 59.4 | 69.2 | 53.4 | 64.2 |
MARS-Cache achieves a throughput (TPS, tokens per second) of 33.6 TPS on EgoSchema (compared to 2.7 TPS without cache and 27.0 TPS for AR baselines) with negligible accuracy loss. VidLaDA consistently outperforms or rivals state-of-the-art AR and other DLM-based VLMs on long-video and multi-task benchmarks (He et al., 25 Jan 2026).
5. Ablation Studies and Analysis
Extensive ablations demonstrate the impact of key architectural and algorithmic choices:
- Anchor Token Count: Zero anchors (53% of FLOPs) reduce accuracy to 63.8/40.7 (Ego/MLVU); full attention (100% FLOPs) attains 65.0/42.6. Pyramid allocation with anchor counts per group at 84% FLOPs matches full-attention accuracy (66.8/44.0).
- Vision/Text Refresh Ratio (): Setting maintains accuracy (65.0/42.3) and increases throughput to 10.3 TPS; larger ratios degrade accuracy.
- Layer-wise Refresh Schedule: Pyramidal schedules (refresh: 64→32→16→8) increase EgoSchema accuracy to 68.0 (baseline 65.0) and MLVU to 45.3 at 7.0 TPS, outperforming uniform refresh strategies.
- Anchor Query Overhead: Using 128 sampled tokens for anchor discovery balances quality with GPU and memory constraints; using all tokens is infeasible due to OOM issues.
6. Implementation and Training Regimen
VidLaDA employs an LLaDA-8B LLM backbone and a SigLIP2-SO400M vision transformer, reducing the number of visual tokens by via bilinear pooling. Training is performed in three curriculum stages:
- 1.8M “short” video clips (10 s–3 min), 32 frames, max 8K tokens, learning rate progression of [2e-6, 1e-5, 1e-5], updating ViT+MLP+LLM.
- 500K “mid” clips, 64 frames, up to 16K tokens, same learning rate, all modules trained end-to-end.
- 500K “long” videos (2–30 min, with 10% text), ViT frozen, fine-tuning only MLP+LLM at 2e-6.
Optimization uses AdamW, batch size of 64 across 32× NVIDIA H200 GPUs, warmup for 3% of steps, cosine learning rate decay, and gradient clipping. Public code and checkpoints are made available at https://github.com/ziHoHe/VidLaDA (He et al., 25 Jan 2026).
7. Context and Implications
VidLaDA advances bidirectional diffusion language modeling for video, showing that by eliminating causal-masking constraints, it produces more robust and contextually grounded representations. The MARS-Cache infrastructure allows bidirectional Video LLMs to match or exceed AR models’ throughput and accuracy for extended, information-rich input sequences. This framework provides a principled resolution to inference bottlenecks introduced by full-sequence modeling and establishes new benchmarks for video understanding in multi-modal machine intelligence (He et al., 25 Jan 2026).
A plausible implication is that future Visual LLM system designs—especially for tasks requiring rich temporal context or low-latency decoding—may increasingly favor hybrid or diffusion-based architectures supplemented by cache-locality optimizations, as exemplified by the VidLaDA pipeline.