Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mask2DiT: Dual Mask Diffusion Transformer

Updated 10 February 2026
  • The paper introduces a dual mask mechanism integrated into the DiT backbone to achieve fine-grained segment-level text-video alignment and auto-regressive scene extension.
  • It employs symmetric binary and segment-level conditional masks to enforce one-to-one prompt alignment while preserving cross-segment continuity.
  • Empirical evaluations show a 15.94 pp gain in visual consistency and an 8.63 pp improvement in sequence consistency compared to models like CogVideoX.

Mask2^2DiT (Dual Mask-based Diffusion Transformer) is a Transformer-based framework that addresses multi-scene long video generation by integrating a dual-mask mechanism into the Diffusion Transformer (DiT) backbone. Mask2^2DiT enables fine-grained segment-level text-to-video alignment, auto-regressive scene extension, and improved temporal and visual coherence across complex multi-segment video sequences (Qi et al., 25 Mar 2025).

1. DiT Backbone and Motivation

The Diffusion Transformer (DiT) architecture replaces the traditional U-Net backbone of diffusion models with a pure Transformer, enabling global attention across visual and textual tokens. In text-to-video variants of DiT, a video xx is first encoded by a causal 3D-VAE into a token sequence zVRLV×dz_V \in \mathbb{R}^{L_V \times d}, while a text prompt is mapped to zTRLT×dz_T \in \mathbb{R}^{L_T \times d}. At each diffusion timestep tt, noisy visual tokens ztz_t are generated:

zt=αˉtϵ+1αˉtzV,ϵN(0,I)z_t = \sqrt{\bar\alpha_t}\,\epsilon + \sqrt{1-\bar\alpha_t}\,z_V, \quad \epsilon\sim\mathcal{N}(0,I)

These are concatenated with zTz_T and input to an encoder-only Transformer vθv_\theta. The typical vv-prediction loss is:

Lpre=EzV,zT,ϵ,tvvθ(zt,t,zT)22,v=αˉtϵ1αˉtzV.L_{\rm pre} = \mathbb{E}_{z_V, z_T, \epsilon, t}\left\| v - v_\theta(z_t, t, z_T) \right\|_2^2, \quad v = \sqrt{\bar\alpha_t}\,\epsilon - \sqrt{1-\bar\alpha_t}\,z_V.

While DiT variants such as CogVideoX excel at generating high-quality single-scene videos, they lack architectural components for segment-level prompt alignment and coherent multi-scene transitions.

2. Dual-Mask Mechanism: Symmetric Binary and Segment-Level Conditional Masks

Mask2^2DiT extends the DiT backbone to handle multi-scene video by employing two complementary binary masks within every Transformer attention layer:

Symmetric Binary Mask.

For a video composed of nn scenes, each with LV(s)L_V^{(s)} visual tokens and LT(s)L_T^{(s)} text tokens, the concatenated input is

[zT1,zT2,...,zTn;zt,1,zt,2,...,zt,n][z_{T_1}, z_{T_2}, ..., z_{T_n}; z_{t,1}, z_{t,2}, ..., z_{t,n}]

of length L=nLT(s)+nLV(s)L = n L_T^{(s)} + n L_V^{(s)}. The attention mask M{0,1}L×LM \in \{0,1\}^{L \times L} is constructed such that:

  • Each text prompt zTkz_{T_k} only attends to its corresponding scene's text and visual tokens (one-to-one alignment).
  • All visual tokens (from all segments) are fully self-connected, preserving temporal coherence and cross-segment continuity. Formally,

Mi,j={1,[i,jI1V...InV](k:iIkT,  jIkTIkV) 0,otherwiseM_{i,j} = \begin{cases} 1, & [i, j \in \mathcal{I}^V_1 \cup ... \cup \mathcal{I}^V_n] \lor (\exists k : i \in \mathcal{I}^T_k, \; j \in \mathcal{I}^T_k \cup \mathcal{I}^V_k) \ 0, & \text{otherwise} \end{cases}

where IkT\mathcal{I}_{k}^T and IkV\mathcal{I}_{k}^V denote the indices of text and visual tokens for scene kk.

Segment-Level Conditional Mask.

To enable auto-regressive generation beyond a fixed number of scenes, Mask2^2DiT incorporates a segment-level conditional mask mc=(mc,1,...,mc,n){0,1}nm_c = (m_{c,1}, ..., m_{c,n}) \in \{0,1\}^n. During training, with probability pp, all segments except the last are set clean (t=0t=0), serving as fixed context, and only the nn-th segment is denoised. The loss is:

Lcond=EzV,zT,ϵ,ti=1nmc,ivvθ(zt,i,t,zTi)22L_{\rm cond} = \mathbb{E}_{z_V, z_T, \epsilon, t} \sum_{i=1}^n m_{c,i} \left\| v - v_\theta(z_{t,i}, t, z_{T_i}) \right\|_2^2

At inference, new video segments can be appended iteratively by treating existing segments as clean context and generating the next with appropriate masking.

3. Masked Self-Attention Implementation

The standard Transformer computes attention as:

Attention(Q,K,V)=Softmax(QKd)V\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right) V

To enforce dual masking, Mask2^2DiT adds a log-zero bias:

Ai,j={0,Mi,j=1 ,Mi,j=0A_{i,j} = \begin{cases} 0, & M_{i,j} = 1 \ -\infty, & M_{i,j} = 0 \end{cases}

Thus,

MaskedAttn(Q,K,V)=Softmax(QKd+A)V\mathrm{MaskedAttn}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^\top}{\sqrt{d}} + A \right) V

This ensures each token only aggregates information as permitted by the symmetric segment alignment and cross-segment continuity constraints.

4. Training and Inference Protocol

Mask2^2DiT adopts a two-stage training regime:

  • Pre-training: On randomly concatenated single-scene clips, using only the symmetric attention mask and optimizing

Lpre(n)=Ei=1nvvθ(zt,i,t,zTi)22L_{\rm pre}^{(n)} = \mathbb{E} \sum_{i=1}^n \| v - v_\theta(z_{t,i}, t, z_{T_i}) \|_2^2

  • Supervised fine-tuning: On real multi-scene videos, interleaving the conditional loss LcondL_{\rm cond} (with probability pp) and the non-conditional loss (probability $1-p$), teaching both fixed-length and auto-regressive generation.

Inference (Auto-Regressive Extension):

  1. Encode existing segments, setting t=0t=0 (clean context).
  2. Sample noise and form noisy tokens for the new segment.
  3. Construct masks (both symmetric and conditional).
  4. Predict visual tokens for the new segment via masked-attention.
  5. Decode to video frames and iterate for further extension.

5. Empirical Evaluation

On a dataset of 5,371 multi-scene videos (three prompts each), Mask2^2DiT yields the following performance:

Metric Mask2^2DiT CogVideoX
Visual Consistency 70.95% 55.01%
Semantic Consistency 23.94% 22.64%
Sequence Consistency 47.45% 38.82%
Fréchet Video Distance 720.01 835.35

Mask2^2DiT achieves a 15.94 pp gain in visual consistency and an 8.63 pp increase in sequence consistency relative to the DiT backbone. Qualitatively, actors, backgrounds, and visual styles exhibit high stability across segments, and each segment accurately matches its dedicated text prompt.

Mask2^2DiT’s dual-mask paradigm is distinct in its segment-specific alignment and temporal continuity objectives for video. Prior works such as MDTv2 (Gao et al., 2023) and Mask2^2DiT for image synthesis (Zheng et al., 2023) use masking strategies—typically on spatial tokens or class-conditioning—for accelerated training and improved context modeling in image diffusion, but do not address segment-level textual grounding or auto-regressive scene compositionality required for multi-scene video. This highlights Mask2^2DiT's role in establishing prompt-segment alignments within long temporal video contexts.

7. Implications and Prospects

The dual-mask design in Mask2^2DiT demonstrates that precise segment-level prompt conditioning and temporal continuity can be achieved within a global self-attention framework by simple yet functionally powerful masking. The architecture enables both fixed-length and flexible, compositional video generation in an auto-regressive regimen. This approach extends the capabilities of DiT-based models, opening avenues for research into large-scale, high-coherence multi-scene video synthesis and providing a strong blueprint for further extensions in conditional and sequential generative modeling (Qi et al., 25 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mask$^2$DiT: Dual Mask-based Diffusion Transformer.