Papers
Topics
Authors
Recent
2000 character limit reached

SDAR-VL: Block-wise Diffusion for VLU

Updated 23 December 2025
  • SDAR-VL is a framework designed for large-scale vision-language understanding using block-wise discrete diffusion with parallel denoising and sequential dependency preservation.
  • It integrates asynchronous noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum to reduce training cost, improve convergence stability, and enhance task performance.
  • Empirical results reveal that SDAR-VL outperforms conventional autoregressive and global diffusion baselines across diverse benchmarks, demonstrating significant efficiency and stability gains.

SDAR-VL (Stable and Efficient Block-wise Discrete Diffusion for Vision-Language Understanding) is a framework that advances the systematic application of block-wise discrete diffusion to large-scale vision–language understanding (VLU). By integrating asynchronous block-wise noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum, SDAR-VL attains improved training efficiency, convergence stability, and task performance compared to conventional block diffusion, autoregressive (AR), and global diffusion baselines. These innovations address major obstacles to practical block-wise diffusion in VLU, including high training cost, slow convergence, and instability (Cheng et al., 16 Dec 2025).

1. Block-wise Discrete Diffusion: Principles and Mathematical Structure

Traditional AR vision-language decoders sequentially generate tokens with strict left-to-right causality, impeding parallel decoding and full-sequence bidirectional context. In contrast, discrete diffusion models randomly corrupt input sequences via token masking and then learn to denoise all masked positions in parallel. While global diffusion models scale poorly (O(L2)O(L^2) per step for sequence length LL) and lack a causal prior for textual structure, block-wise discrete diffusion (BD³) offers a hybrid approach.

A token sequence x=(xi)1iLx = (x^i)_{1 \leq i \leq L} is partitioned into BB blocks x=[x1,x2,,xB]x = [x^1, x^2, \ldots, x^B], each with L=L/BL' = L/B tokens. The model factorizes the data likelihood as:

logpθ(x)=b=1Blogpθ(xbx<b)\log p_\theta(x) = \sum_{b=1}^{B} \log p_{\theta}(x^b | x^{<b})

For each block bb, the forward process masks a random fraction tt of tokens to obtain xtbx_t^b. The reverse process, parameterized by a transformer with block-causal self-attention, reconstructs masked tokens using bidirectional information within the block and full context from preceding blocks.

The negative evidence lower bound (NELBO) per block is:

LBD3(θ)=Ex,b,t[1tMtblogpθ(x0b,xtb,x<b)]\mathcal{L}_{\text{BD3}}(\theta) = \mathbb{E}_{x,b,t} \left[ -\frac{1}{t} \sum_{\ell \in \mathcal{M}_t^b} \log p_\theta(x_0^{b,\ell} | x_t^b, x^{<b}) \right]

Here, Mtb\mathcal{M}_t^b denotes the set of masked positions sampled for block bb at corruption level tt (Cheng et al., 16 Dec 2025).

2. Integrated SDAR-VL Training Components

SDAR-VL incorporates three synergistic components to address instability and inefficiency in block-wise diffusion:

  • Asynchronous Block-wise Noise Scheduling (ABNS): Each block bb in a sequence independently samples a noise level tbt_b from a (potentially curriculum-controlled) distribution, rather than applying a single tt to all blocks. This strategy diversifies corruption levels within each batch, reducing loss variance and stabilizing gradients.
  • Effective Mask Ratio Scaling (EMRS): The empirical masked fraction tbt_b' may differ from target tbt_b due to stochastic masking. EMRS replaces 1/tb1/t_b with 1/tb1/t_b' in loss normalization, yielding an unbiased objective and further smoothing training dynamics.
  • Progressive Beta-Distribution Noise Curriculum (PBNC): The masking schedule gradually increases mean and concentration parameters of the corruption-level Beta distribution during training, improving sample efficiency by increasing mask coverage while retaining corruption diversity through late exposure to low-mask cases.

The table below summarizes the role of each component:

Component Objective Effect on Training
ABNS Diversify block corruption per batch Reduces gradient variance
EMRS Use actual mask ratio for weighting Removes bias in loss scaling
PBNC Curriculum over noise schedule Enhances sample efficiency

3. Training Protocol and Convergence Behavior

SDAR-VL is trained using a multi-stage protocol over ≈70 billion tokens, with stages including projector-only alignment, multimodal capability acquisition, reasoning, and long chain-of-thought (CoT) distillation. Sequence lengths are packed to 8k–16k tokens per sample to fully utilize GPU memory. Key optimization hyperparameters include:

  • Vision tower learning rate: 2×1062 \times 10^{-6}
  • Language tower learning rate: 1×1051 \times 10^{-5}
  • Projector learning rate: 1×1031×1051 \times 10^{-3} \rightarrow 1 \times 10^{-5}
  • AdamW optimizer with β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95, weight decay $0.1$, and linear LR warmup
  • Block count B=4B = 4–$8$, block length LL' matching SDAR-Chat (Cheng et al., 16 Dec 2025)

Three curves empirically validate the training components: ABNS reduces per-step loss variance; EMRS further accelerates convergence; PBNC enables high final mask ratios without sacrificing diversity, improving performance by ≈1% absolute. Gradient variance reduction from ABNS is measured at 1530%15-30\% for the 4B variant.

4. Empirical Results and Ablation Studies

SDAR-VL achieves competitive or superior task performance relative to both AR (LLaVA-OneVision) and global diffusion (LLaDA-V) baselines across an evaluation suite of 21 benchmarks, including single-image, multi-image, video, and math/document tasks.

4.1 Controlled Group Benchmark Averages

Model Single-Img Avg Multi-Img/Video Avg
LLaDA-V-8B ≈59.0% 61.0%
LLaVA-OV-7B (AR) ≈58.0% 66.8%
SDAR-VL-4B ≈57.6% 61.7%
SDAR-VL-8B ≈62.5% 65.0%

4.2 Chain-of-Thought Distillation

SDAR-VL-Think-8B, after distillation from R1-OneVision, matches or exceeds AR baselines in math-oriented benchmarks, with up to +14% lift on MathVerse.

4.3 Component Ablation

Ablation on the 4B variant shows incremental gains as each training module is added:

Method ABNS EMRS PBNC (C=50) Downstream Avg
Baseline (SNS) 36.8
+ ABNS 37.5 (+0.7)
+ EMRS 37.8 (+1.0)
+ PBNC 40.1 (+3.3)

EMRS yields the largest individual improvement; PBNC adds a further ≈2–3% (Cheng et al., 16 Dec 2025).

5. Efficiency, Stability, and Limitations

SDAR-VL attains efficiency by reducing per-step complexity from O(L2)O(L^2) (global diffusion) to O(BL2)O(B \cdot {L'}^2) through block partitioning, while preserving AR dependency between blocks. ABNS and EMRS jointly reduce training instability and bias, and PBNC balances mask coverage with token-sample diversity. The method currently retains the diffusion requirement for multiple denoising steps per block during inference. Potential future enhancements include:

  • Reducing block-level inference steps via learned or one-step schedules
  • Adapting to dynamic block partitioning for variable-length multimodal data
  • Extending block-diffusion to generative tasks (infill, captioning)
  • Integrating key-value cache acceleration for further throughput gains

A plausible implication is that block-wise discrete diffusion, with properly tuned scheduling, mask scaling, and curriculum, constitutes a viable, scalable backbone for vision–language understanding at large scale, bridging the historical gap to autoregressive and global diffusion designs (Cheng et al., 16 Dec 2025).

6. Relation to Broader Vision-Language Modeling Paradigms

SDAR-VL represents a distinct alternative to both classic AR and global diffusion approaches for VLU. It combines parallelism and bidirectionality within blocks, serial causality across blocks, and exploits curriculum learning to maximize sample efficiency. By demonstrating that block-wise diffusion, when carefully stabilized, achieves state-of-the-art results on diverse VLU benchmarks, SDAR-VL contributes a new axis for the design of multimodal transformers and could inform future architectures that require efficient, stable modeling of long-range multimodal dependencies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SDAR-VL.