Mask2DiT: Dual Mask Diffusion Transformer

Updated 10 February 2026

The paper introduces a dual mask mechanism integrated into the DiT backbone to achieve fine-grained segment-level text-video alignment and auto-regressive scene extension.
It employs symmetric binary and segment-level conditional masks to enforce one-to-one prompt alignment while preserving cross-segment continuity.
Empirical evaluations show a 15.94 pp gain in visual consistency and an 8.63 pp improvement in sequence consistency compared to models like CogVideoX.

Mask $^2$ DiT (Dual Mask-based Diffusion Transformer) is a Transformer-based framework that addresses multi-scene long video generation by integrating a dual-mask mechanism into the Diffusion Transformer (DiT) backbone. Mask $^2$ DiT enables fine-grained segment-level text-to-video alignment, auto-regressive scene extension, and improved temporal and visual coherence across complex multi-segment video sequences (Qi et al., 25 Mar 2025).

1. DiT Backbone and Motivation

The Diffusion Transformer (DiT) architecture replaces the traditional U-Net backbone of diffusion models with a pure Transformer, enabling global attention across visual and textual tokens. In text-to-video variants of DiT, a video $x$ is first encoded by a causal 3D-VAE into a token sequence $z_V \in \mathbb{R}^{L_V \times d}$ , while a text prompt is mapped to $z_T \in \mathbb{R}^{L_T \times d}$ . At each diffusion timestep $t$ , noisy visual tokens $z_t$ are generated:

$z_t = \sqrt{\bar\alpha_t}\,\epsilon + \sqrt{1-\bar\alpha_t}\,z_V, \quad \epsilon\sim\mathcal{N}(0,I)$

These are concatenated with $z_T$ and input to an encoder-only Transformer $v_\theta$ . The typical $v$ -prediction loss is:

$L_{\rm pre} = \mathbb{E}_{z_V, z_T, \epsilon, t}\left\| v - v_\theta(z_t, t, z_T) \right\|_2^2, \quad v = \sqrt{\bar\alpha_t}\,\epsilon - \sqrt{1-\bar\alpha_t}\,z_V.$

While DiT variants such as CogVideoX excel at generating high-quality single-scene videos, they lack architectural components for segment-level prompt alignment and coherent multi-scene transitions.

2. Dual-Mask Mechanism: Symmetric Binary and Segment-Level Conditional Masks

Mask $^2$ DiT extends the DiT backbone to handle multi-scene video by employing two complementary binary masks within every Transformer attention layer:

Symmetric Binary Mask.

For a video composed of $n$ scenes, each with $L_V^{(s)}$ visual tokens and $L_T^{(s)}$ text tokens, the concatenated input is

$[z_{T_1}, z_{T_2}, ..., z_{T_n}; z_{t,1}, z_{t,2}, ..., z_{t,n}]$

of length $L = n L_T^{(s)} + n L_V^{(s)}$ . The attention mask $M \in \{0,1\}^{L \times L}$ is constructed such that:

Each text prompt $z_{T_k}$ only attends to its corresponding scene's text and visual tokens (one-to-one alignment).
All visual tokens (from all segments) are fully self-connected, preserving temporal coherence and cross-segment continuity. Formally,

$M_{i,j} = \begin{cases} 1, & [i, j \in \mathcal{I}^V_1 \cup ... \cup \mathcal{I}^V_n] \lor (\exists k : i \in \mathcal{I}^T_k, \; j \in \mathcal{I}^T_k \cup \mathcal{I}^V_k) \ 0, & \text{otherwise} \end{cases}$

where $\mathcal{I}_{k}^T$ and $\mathcal{I}_{k}^V$ denote the indices of text and visual tokens for scene $k$ .

Segment-Level Conditional Mask.

To enable auto-regressive generation beyond a fixed number of scenes, Mask $^2$ DiT incorporates a segment-level conditional mask $m_c = (m_{c,1}, ..., m_{c,n}) \in \{0,1\}^n$ . During training, with probability $p$ , all segments except the last are set clean ( $t=0$ ), serving as fixed context, and only the $n$ -th segment is denoised. The loss is:

$L_{\rm cond} = \mathbb{E}_{z_V, z_T, \epsilon, t} \sum_{i=1}^n m_{c,i} \left\| v - v_\theta(z_{t,i}, t, z_{T_i}) \right\|_2^2$

At inference, new video segments can be appended iteratively by treating existing segments as clean context and generating the next with appropriate masking.

3. Masked Self-Attention Implementation

The standard Transformer computes attention as:

$\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right) V$

To enforce dual masking, Mask $^2$ DiT adds a log-zero bias:

$A_{i,j} = \begin{cases} 0, & M_{i,j} = 1 \ -\infty, & M_{i,j} = 0 \end{cases}$

Thus,

$\mathrm{MaskedAttn}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^\top}{\sqrt{d}} + A \right) V$

This ensures each token only aggregates information as permitted by the symmetric segment alignment and cross-segment continuity constraints.

4. Training and Inference Protocol

Mask $^2$ DiT adopts a two-stage training regime:

Pre-training: On randomly concatenated single-scene clips, using only the symmetric attention mask and optimizing

$L_{\rm pre}^{(n)} = \mathbb{E} \sum_{i=1}^n \| v - v_\theta(z_{t,i}, t, z_{T_i}) \|_2^2$

Supervised fine-tuning: On real multi-scene videos, interleaving the conditional loss $L_{\rm cond}$ (with probability $p$ ) and the non-conditional loss (probability $1-p$), teaching both fixed-length and auto-regressive generation.

Inference (Auto-Regressive Extension):

Encode existing segments, setting $t=0$ (clean context).
Sample noise and form noisy tokens for the new segment.
Construct masks (both symmetric and conditional).
Predict visual tokens for the new segment via masked-attention.
Decode to video frames and iterate for further extension.

5. Empirical Evaluation

On a dataset of 5,371 multi-scene videos (three prompts each), Mask $^2$ DiT yields the following performance:

Metric	Mask $^2$ DiT	CogVideoX
Visual Consistency	70.95%	55.01%
Semantic Consistency	23.94%	22.64%
Sequence Consistency	47.45%	38.82%
Fréchet Video Distance	720.01	835.35

Mask $^2$ DiT achieves a 15.94 pp gain in visual consistency and an 8.63 pp increase in sequence consistency relative to the DiT backbone. Qualitatively, actors, backgrounds, and visual styles exhibit high stability across segments, and each segment accurately matches its dedicated text prompt.

Mask $^2$ DiT’s dual-mask paradigm is distinct in its segment-specific alignment and temporal continuity objectives for video. Prior works such as MDTv2 (Gao et al., 2023) and Mask $^2$ DiT for image synthesis (Zheng et al., 2023) use masking strategies—typically on spatial tokens or class-conditioning—for accelerated training and improved context modeling in image diffusion, but do not address segment-level textual grounding or auto-regressive scene compositionality required for multi-scene video. This highlights Mask $^2$ DiT's role in establishing prompt-segment alignments within long temporal video contexts.

7. Implications and Prospects

The dual-mask design in Mask $^2$ DiT demonstrates that precise segment-level prompt conditioning and temporal continuity can be achieved within a global self-attention framework by simple yet functionally powerful masking. The architecture enables both fixed-length and flexible, compositional video generation in an auto-regressive regimen. This approach extends the capabilities of DiT-based models, opening avenues for research into large-scale, high-coherence multi-scene video synthesis and providing a strong blueprint for further extensions in conditional and sequential generative modeling (Qi et al., 25 Mar 2025).