Mask2DiT: Dual Mask Diffusion Transformer
- The paper introduces a dual mask mechanism integrated into the DiT backbone to achieve fine-grained segment-level text-video alignment and auto-regressive scene extension.
- It employs symmetric binary and segment-level conditional masks to enforce one-to-one prompt alignment while preserving cross-segment continuity.
- Empirical evaluations show a 15.94 pp gain in visual consistency and an 8.63 pp improvement in sequence consistency compared to models like CogVideoX.
MaskDiT (Dual Mask-based Diffusion Transformer) is a Transformer-based framework that addresses multi-scene long video generation by integrating a dual-mask mechanism into the Diffusion Transformer (DiT) backbone. MaskDiT enables fine-grained segment-level text-to-video alignment, auto-regressive scene extension, and improved temporal and visual coherence across complex multi-segment video sequences (Qi et al., 25 Mar 2025).
1. DiT Backbone and Motivation
The Diffusion Transformer (DiT) architecture replaces the traditional U-Net backbone of diffusion models with a pure Transformer, enabling global attention across visual and textual tokens. In text-to-video variants of DiT, a video is first encoded by a causal 3D-VAE into a token sequence , while a text prompt is mapped to . At each diffusion timestep , noisy visual tokens are generated:
These are concatenated with and input to an encoder-only Transformer . The typical -prediction loss is:
While DiT variants such as CogVideoX excel at generating high-quality single-scene videos, they lack architectural components for segment-level prompt alignment and coherent multi-scene transitions.
2. Dual-Mask Mechanism: Symmetric Binary and Segment-Level Conditional Masks
MaskDiT extends the DiT backbone to handle multi-scene video by employing two complementary binary masks within every Transformer attention layer:
Symmetric Binary Mask.
For a video composed of scenes, each with visual tokens and text tokens, the concatenated input is
of length . The attention mask is constructed such that:
- Each text prompt only attends to its corresponding scene's text and visual tokens (one-to-one alignment).
- All visual tokens (from all segments) are fully self-connected, preserving temporal coherence and cross-segment continuity. Formally,
where and denote the indices of text and visual tokens for scene .
Segment-Level Conditional Mask.
To enable auto-regressive generation beyond a fixed number of scenes, MaskDiT incorporates a segment-level conditional mask . During training, with probability , all segments except the last are set clean (), serving as fixed context, and only the -th segment is denoised. The loss is:
At inference, new video segments can be appended iteratively by treating existing segments as clean context and generating the next with appropriate masking.
3. Masked Self-Attention Implementation
The standard Transformer computes attention as:
To enforce dual masking, MaskDiT adds a log-zero bias:
Thus,
This ensures each token only aggregates information as permitted by the symmetric segment alignment and cross-segment continuity constraints.
4. Training and Inference Protocol
MaskDiT adopts a two-stage training regime:
- Pre-training: On randomly concatenated single-scene clips, using only the symmetric attention mask and optimizing
- Supervised fine-tuning: On real multi-scene videos, interleaving the conditional loss (with probability ) and the non-conditional loss (probability $1-p$), teaching both fixed-length and auto-regressive generation.
Inference (Auto-Regressive Extension):
- Encode existing segments, setting (clean context).
- Sample noise and form noisy tokens for the new segment.
- Construct masks (both symmetric and conditional).
- Predict visual tokens for the new segment via masked-attention.
- Decode to video frames and iterate for further extension.
5. Empirical Evaluation
On a dataset of 5,371 multi-scene videos (three prompts each), MaskDiT yields the following performance:
| Metric | MaskDiT | CogVideoX |
|---|---|---|
| Visual Consistency | 70.95% | 55.01% |
| Semantic Consistency | 23.94% | 22.64% |
| Sequence Consistency | 47.45% | 38.82% |
| Fréchet Video Distance | 720.01 | 835.35 |
MaskDiT achieves a 15.94 pp gain in visual consistency and an 8.63 pp increase in sequence consistency relative to the DiT backbone. Qualitatively, actors, backgrounds, and visual styles exhibit high stability across segments, and each segment accurately matches its dedicated text prompt.
6. Architectural Context and Related Dual-Mask Approaches
MaskDiT’s dual-mask paradigm is distinct in its segment-specific alignment and temporal continuity objectives for video. Prior works such as MDTv2 (Gao et al., 2023) and MaskDiT for image synthesis (Zheng et al., 2023) use masking strategies—typically on spatial tokens or class-conditioning—for accelerated training and improved context modeling in image diffusion, but do not address segment-level textual grounding or auto-regressive scene compositionality required for multi-scene video. This highlights MaskDiT's role in establishing prompt-segment alignments within long temporal video contexts.
7. Implications and Prospects
The dual-mask design in MaskDiT demonstrates that precise segment-level prompt conditioning and temporal continuity can be achieved within a global self-attention framework by simple yet functionally powerful masking. The architecture enables both fixed-length and flexible, compositional video generation in an auto-regressive regimen. This approach extends the capabilities of DiT-based models, opening avenues for research into large-scale, high-coherence multi-scene video synthesis and providing a strong blueprint for further extensions in conditional and sequential generative modeling (Qi et al., 25 Mar 2025).