SDAR-VL: Block-wise Diffusion for VLU
- SDAR-VL is a framework designed for large-scale vision-language understanding using block-wise discrete diffusion with parallel denoising and sequential dependency preservation.
- It integrates asynchronous noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum to reduce training cost, improve convergence stability, and enhance task performance.
- Empirical results reveal that SDAR-VL outperforms conventional autoregressive and global diffusion baselines across diverse benchmarks, demonstrating significant efficiency and stability gains.
SDAR-VL (Stable and Efficient Block-wise Discrete Diffusion for Vision-Language Understanding) is a framework that advances the systematic application of block-wise discrete diffusion to large-scale vision–language understanding (VLU). By integrating asynchronous block-wise noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum, SDAR-VL attains improved training efficiency, convergence stability, and task performance compared to conventional block diffusion, autoregressive (AR), and global diffusion baselines. These innovations address major obstacles to practical block-wise diffusion in VLU, including high training cost, slow convergence, and instability (Cheng et al., 16 Dec 2025).
1. Block-wise Discrete Diffusion: Principles and Mathematical Structure
Traditional AR vision-language decoders sequentially generate tokens with strict left-to-right causality, impeding parallel decoding and full-sequence bidirectional context. In contrast, discrete diffusion models randomly corrupt input sequences via token masking and then learn to denoise all masked positions in parallel. While global diffusion models scale poorly ( per step for sequence length ) and lack a causal prior for textual structure, block-wise discrete diffusion (BD³) offers a hybrid approach.
A token sequence is partitioned into blocks , each with tokens. The model factorizes the data likelihood as:
For each block , the forward process masks a random fraction of tokens to obtain . The reverse process, parameterized by a transformer with block-causal self-attention, reconstructs masked tokens using bidirectional information within the block and full context from preceding blocks.
The negative evidence lower bound (NELBO) per block is:
Here, denotes the set of masked positions sampled for block at corruption level (Cheng et al., 16 Dec 2025).
2. Integrated SDAR-VL Training Components
SDAR-VL incorporates three synergistic components to address instability and inefficiency in block-wise diffusion:
- Asynchronous Block-wise Noise Scheduling (ABNS): Each block in a sequence independently samples a noise level from a (potentially curriculum-controlled) distribution, rather than applying a single to all blocks. This strategy diversifies corruption levels within each batch, reducing loss variance and stabilizing gradients.
- Effective Mask Ratio Scaling (EMRS): The empirical masked fraction may differ from target due to stochastic masking. EMRS replaces with in loss normalization, yielding an unbiased objective and further smoothing training dynamics.
- Progressive Beta-Distribution Noise Curriculum (PBNC): The masking schedule gradually increases mean and concentration parameters of the corruption-level Beta distribution during training, improving sample efficiency by increasing mask coverage while retaining corruption diversity through late exposure to low-mask cases.
The table below summarizes the role of each component:
| Component | Objective | Effect on Training |
|---|---|---|
| ABNS | Diversify block corruption per batch | Reduces gradient variance |
| EMRS | Use actual mask ratio for weighting | Removes bias in loss scaling |
| PBNC | Curriculum over noise schedule | Enhances sample efficiency |
3. Training Protocol and Convergence Behavior
SDAR-VL is trained using a multi-stage protocol over ≈70 billion tokens, with stages including projector-only alignment, multimodal capability acquisition, reasoning, and long chain-of-thought (CoT) distillation. Sequence lengths are packed to 8k–16k tokens per sample to fully utilize GPU memory. Key optimization hyperparameters include:
- Vision tower learning rate:
- Language tower learning rate:
- Projector learning rate:
- AdamW optimizer with , , weight decay $0.1$, and linear LR warmup
- Block count –$8$, block length matching SDAR-Chat (Cheng et al., 16 Dec 2025)
Three curves empirically validate the training components: ABNS reduces per-step loss variance; EMRS further accelerates convergence; PBNC enables high final mask ratios without sacrificing diversity, improving performance by ≈1% absolute. Gradient variance reduction from ABNS is measured at for the 4B variant.
4. Empirical Results and Ablation Studies
SDAR-VL achieves competitive or superior task performance relative to both AR (LLaVA-OneVision) and global diffusion (LLaDA-V) baselines across an evaluation suite of 21 benchmarks, including single-image, multi-image, video, and math/document tasks.
4.1 Controlled Group Benchmark Averages
| Model | Single-Img Avg | Multi-Img/Video Avg |
|---|---|---|
| LLaDA-V-8B | ≈59.0% | 61.0% |
| LLaVA-OV-7B (AR) | ≈58.0% | 66.8% |
| SDAR-VL-4B | ≈57.6% | 61.7% |
| SDAR-VL-8B | ≈62.5% | 65.0% |
4.2 Chain-of-Thought Distillation
SDAR-VL-Think-8B, after distillation from R1-OneVision, matches or exceeds AR baselines in math-oriented benchmarks, with up to +14% lift on MathVerse.
4.3 Component Ablation
Ablation on the 4B variant shows incremental gains as each training module is added:
| Method | ABNS | EMRS | PBNC (C=50) | Downstream Avg |
|---|---|---|---|---|
| Baseline (SNS) | – | – | – | 36.8 |
| + ABNS | ✓ | – | – | 37.5 (+0.7) |
| + EMRS | ✓ | ✓ | – | 37.8 (+1.0) |
| + PBNC | ✓ | ✓ | ✓ | 40.1 (+3.3) |
EMRS yields the largest individual improvement; PBNC adds a further ≈2–3% (Cheng et al., 16 Dec 2025).
5. Efficiency, Stability, and Limitations
SDAR-VL attains efficiency by reducing per-step complexity from (global diffusion) to through block partitioning, while preserving AR dependency between blocks. ABNS and EMRS jointly reduce training instability and bias, and PBNC balances mask coverage with token-sample diversity. The method currently retains the diffusion requirement for multiple denoising steps per block during inference. Potential future enhancements include:
- Reducing block-level inference steps via learned or one-step schedules
- Adapting to dynamic block partitioning for variable-length multimodal data
- Extending block-diffusion to generative tasks (infill, captioning)
- Integrating key-value cache acceleration for further throughput gains
A plausible implication is that block-wise discrete diffusion, with properly tuned scheduling, mask scaling, and curriculum, constitutes a viable, scalable backbone for vision–language understanding at large scale, bridging the historical gap to autoregressive and global diffusion designs (Cheng et al., 16 Dec 2025).
6. Relation to Broader Vision-Language Modeling Paradigms
SDAR-VL represents a distinct alternative to both classic AR and global diffusion approaches for VLU. It combines parallelism and bidirectionality within blocks, serial causality across blocks, and exploits curriculum learning to maximize sample efficiency. By demonstrating that block-wise diffusion, when carefully stabilized, achieves state-of-the-art results on diverse VLU benchmarks, SDAR-VL contributes a new axis for the design of multimodal transformers and could inform future architectures that require efficient, stable modeling of long-range multimodal dependencies.