SDAR-VL: Block-wise Diffusion for VLU

Updated 23 December 2025

SDAR-VL is a framework designed for large-scale vision-language understanding using block-wise discrete diffusion with parallel denoising and sequential dependency preservation.
It integrates asynchronous noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum to reduce training cost, improve convergence stability, and enhance task performance.
Empirical results reveal that SDAR-VL outperforms conventional autoregressive and global diffusion baselines across diverse benchmarks, demonstrating significant efficiency and stability gains.

SDAR-VL (Stable and Efficient Block-wise Discrete Diffusion for Vision-Language Understanding) is a framework that advances the systematic application of block-wise discrete diffusion to large-scale vision–language understanding (VLU). By integrating asynchronous block-wise noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum, SDAR-VL attains improved training efficiency, convergence stability, and task performance compared to conventional block diffusion, autoregressive (AR), and global diffusion baselines. These innovations address major obstacles to practical block-wise diffusion in VLU, including high training cost, slow convergence, and instability (Cheng et al., 16 Dec 2025).

1. Block-wise Discrete Diffusion: Principles and Mathematical Structure

Traditional AR vision-language decoders sequentially generate tokens with strict left-to-right causality, impeding parallel decoding and full-sequence bidirectional context. In contrast, discrete diffusion models randomly corrupt input sequences via token masking and then learn to denoise all masked positions in parallel. While global diffusion models scale poorly ( $O(L^2)$ per step for sequence length $L$ ) and lack a causal prior for textual structure, block-wise discrete diffusion (BD³) offers a hybrid approach.

A token sequence $x = (x^i)_{1 \leq i \leq L}$ is partitioned into $B$ blocks $x = [x^1, x^2, \ldots, x^B]$ , each with $L' = L/B$ tokens. The model factorizes the data likelihood as:

$\log p_\theta(x) = \sum_{b=1}^{B} \log p_{\theta}(x^b | x^{<b})$

For each block $b$ , the forward process masks a random fraction $t$ of tokens to obtain $x_t^b$ . The reverse process, parameterized by a transformer with block-causal self-attention, reconstructs masked tokens using bidirectional information within the block and full context from preceding blocks.

The negative evidence lower bound (NELBO) per block is:

$\mathcal{L}_{\text{BD3}}(\theta) = \mathbb{E}_{x,b,t} \left[ -\frac{1}{t} \sum_{\ell \in \mathcal{M}_t^b} \log p_\theta(x_0^{b,\ell} | x_t^b, x^{<b}) \right]$

Here, $\mathcal{M}_t^b$ denotes the set of masked positions sampled for block $b$ at corruption level $t$ (Cheng et al., 16 Dec 2025).

2. Integrated SDAR-VL Training Components

SDAR-VL incorporates three synergistic components to address instability and inefficiency in block-wise diffusion:

Asynchronous Block-wise Noise Scheduling (ABNS): Each block $b$ in a sequence independently samples a noise level $t_b$ from a (potentially curriculum-controlled) distribution, rather than applying a single $t$ to all blocks. This strategy diversifies corruption levels within each batch, reducing loss variance and stabilizing gradients.
Effective Mask Ratio Scaling (EMRS): The empirical masked fraction $t_b'$ may differ from target $t_b$ due to stochastic masking. EMRS replaces $1/t_b$ with $1/t_b'$ in loss normalization, yielding an unbiased objective and further smoothing training dynamics.
Progressive Beta-Distribution Noise Curriculum (PBNC): The masking schedule gradually increases mean and concentration parameters of the corruption-level Beta distribution during training, improving sample efficiency by increasing mask coverage while retaining corruption diversity through late exposure to low-mask cases.

The table below summarizes the role of each component:

Component	Objective	Effect on Training
ABNS	Diversify block corruption per batch	Reduces gradient variance
EMRS	Use actual mask ratio for weighting	Removes bias in loss scaling
PBNC	Curriculum over noise schedule	Enhances sample efficiency

3. Training Protocol and Convergence Behavior

SDAR-VL is trained using a multi-stage protocol over ≈70 billion tokens, with stages including projector-only alignment, multimodal capability acquisition, reasoning, and long chain-of-thought (CoT) distillation. Sequence lengths are packed to 8k–16k tokens per sample to fully utilize GPU memory. Key optimization hyperparameters include:

Vision tower learning rate: $2 \times 10^{-6}$
Language tower learning rate: $1 \times 10^{-5}$
Projector learning rate: $1 \times 10^{-3} \rightarrow 1 \times 10^{-5}$
AdamW optimizer with $\beta_1=0.9$ , $\beta_2=0.95$ , weight decay $0.1$, and linear LR warmup
Block count $B = 4$ –$8$, block length $L'$ matching SDAR-Chat (Cheng et al., 16 Dec 2025)

Three curves empirically validate the training components: ABNS reduces per-step loss variance; EMRS further accelerates convergence; PBNC enables high final mask ratios without sacrificing diversity, improving performance by ≈1% absolute. Gradient variance reduction from ABNS is measured at $15-30\%$ for the 4B variant.

4. Empirical Results and Ablation Studies

SDAR-VL achieves competitive or superior task performance relative to both AR (LLaVA-OneVision) and global diffusion (LLaDA-V) baselines across an evaluation suite of 21 benchmarks, including single-image, multi-image, video, and math/document tasks.

4.1 Controlled Group Benchmark Averages

Model	Single-Img Avg	Multi-Img/Video Avg
LLaDA-V-8B	≈59.0%	61.0%
LLaVA-OV-7B (AR)	≈58.0%	66.8%
SDAR-VL-4B	≈57.6%	61.7%
SDAR-VL-8B	≈62.5%	65.0%

4.2 Chain-of-Thought Distillation

SDAR-VL-Think-8B, after distillation from R1-OneVision, matches or exceeds AR baselines in math-oriented benchmarks, with up to +14% lift on MathVerse.

4.3 Component Ablation

Ablation on the 4B variant shows incremental gains as each training module is added:

Method	ABNS	EMRS	PBNC (C=50)	Downstream Avg
Baseline (SNS)	–	–	–	36.8
+ ABNS	✓	–	–	37.5 (+0.7)
+ EMRS	✓	✓	–	37.8 (+1.0)
+ PBNC	✓	✓	✓	40.1 (+3.3)

EMRS yields the largest individual improvement; PBNC adds a further ≈2–3% (Cheng et al., 16 Dec 2025).

5. Efficiency, Stability, and Limitations

SDAR-VL attains efficiency by reducing per-step complexity from $O(L^2)$ (global diffusion) to $O(B \cdot {L'}^2)$ through block partitioning, while preserving AR dependency between blocks. ABNS and EMRS jointly reduce training instability and bias, and PBNC balances mask coverage with token-sample diversity. The method currently retains the diffusion requirement for multiple denoising steps per block during inference. Potential future enhancements include:

Reducing block-level inference steps via learned or one-step schedules
Adapting to dynamic block partitioning for variable-length multimodal data
Extending block-diffusion to generative tasks (infill, captioning)
Integrating key-value cache acceleration for further throughput gains

A plausible implication is that block-wise discrete diffusion, with properly tuned scheduling, mask scaling, and curriculum, constitutes a viable, scalable backbone for vision–language understanding at large scale, bridging the historical gap to autoregressive and global diffusion designs (Cheng et al., 16 Dec 2025).

6. Relation to Broader Vision-Language Modeling Paradigms

SDAR-VL represents a distinct alternative to both classic AR and global diffusion approaches for VLU. It combines parallelism and bidirectionality within blocks, serial causality across blocks, and exploits curriculum learning to maximize sample efficiency. By demonstrating that block-wise diffusion, when carefully stabilized, achieves state-of-the-art results on diverse VLU benchmarks, SDAR-VL contributes a new axis for the design of multimodal transformers and could inform future architectures that require efficient, stable modeling of long-range multimodal dependencies.

Markdown Upgrade to Chat

References (1)

SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SDAR-VL.

SDAR-VL: Block-wise Diffusion for VLU

1. Block-wise Discrete Diffusion: Principles and Mathematical Structure

2. Integrated SDAR-VL Training Components

3. Training Protocol and Convergence Behavior

4. Empirical Results and Ablation Studies

4.1 Controlled Group Benchmark Averages

4.2 Chain-of-Thought Distillation

4.3 Component Ablation

5. Efficiency, Stability, and Limitations

6. Relation to Broader Vision-Language Modeling Paradigms

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SDAR-VL: Block-wise Diffusion for VLU

1. Block-wise Discrete Diffusion: Principles and Mathematical Structure

2. Integrated SDAR-VL Training Components

3. Training Protocol and Convergence Behavior

4. Empirical Results and Ablation Studies

4.1 Controlled Group Benchmark Averages

4.2 Chain-of-Thought Distillation

4.3 Component Ablation

5. Efficiency, Stability, and Limitations

6. Relation to Broader Vision-Language Modeling Paradigms

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research