One-Minute Video Generation with Test-Time Training (2504.05298v1)

Published 7 Apr 2025 in cs.CV

Abstract: Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

Summary

The paper demonstrates that incorporating Test-Time Training layers into pre-trained Diffusion Transformers enables coherent one-minute video generation from complex text inputs.
It leverages a hybrid local-global approach with TTT-MLP layers, gating, and bidirectional processing to efficiently compress long-range dependencies.
Empirical results on Tom and Jerry cartoons show significant gains, with +38 Elo in temporal consistency and +39 Elo in motion naturalness over baseline models.

This paper introduces a method for generating one-minute videos with complex, multi-scene stories by incorporating Test-Time Training (TTT) layers into a pre-trained Diffusion Transformer. The core problem addressed is the limitation of standard Transformers in handling long video sequences due to the quadratic complexity of self-attention, and the inadequacy of existing linear-complexity RNN alternatives (like Mamba) in capturing complex long-range dependencies due to their less expressive hidden states.

The proposed solution leverages TTT layers (2407.04620), a type of RNN where the hidden state itself is a neural network (specifically, a two-layer MLP, termed TTT-MLP). This hidden state network is updated via gradient steps on a self-supervised reconstruction loss during the processing of the input sequence, even at test time. This allows the TTT layer to dynamically adapt and compress long historical context more effectively than traditional RNNs with fixed-size matrix hidden states.

Implementation:

Architecture Modification: The authors start with a pre-trained CogVideo-X 5B Diffusion Transformer [hong2023cogvideo], originally designed for short clips (3 seconds). They insert TTT-MLP layers into the sequence modeling blocks, specifically after the self-attention layer.
- Gating: A learned gating mechanism is used to smoothly integrate the randomly initialized TTT layers during fine-tuning, preventing initial performance degradation. The gate is initialized close to zero.
- Bi-directionality: Since diffusion models are non-causal, TTT layers are applied bidirectionally (processing the sequence forward and then backward) to capture context from both past and future tokens relative to a given position.
Processing Pipeline:
- Videos are divided into 3-second segments, aligning with the pre-trained model's capacity.
- Text prompts can be provided in three formats (short summary, sentence-level plot, detailed storyboard), with detailed storyboards (Format 3) being the final input format used for both training and inference after potential conversion using Claude 3.7 Sonnet.
- Input sequences are formed by concatenating text tokens and (noisy) video tokens for each 3-second segment, resulting in an interleaved sequence for the entire video.
- Hybrid Attention/TTT: Self-attention is applied locally within each 3-second segment to manage computational cost. TTT layers are applied globally across the concatenated sequence of all segments, enabling long-range temporal modeling with linear complexity.
Dataset and Fine-tuning:
- A dataset was curated using ~7 hours of Tom and Jerry cartoons (1940-1948). Videos were super-resolved, segmented into 3-second clips, and annotated with detailed paragraph descriptions (storyboards) for each segment.
- A multi-stage fine-tuning strategy was employed, starting with adapting the full model to the Tom and Jerry domain on 3-second clips, then progressively increasing the video length (9, 18, 30, 63 seconds) while only fine-tuning the TTT layers, gates, and attention parameters at a lower learning rate.
GPU Optimization:
- To handle the large hidden state (MLP weights) of TTT-MLP layers efficiently, an "On-Chip Tensor Parallel" approach was implemented. This shards the MLP weights across the fast on-chip memory (SMEM) of multiple Streaming Multiprocessors (SMs) on NVIDIA Hopper GPUs, using DSMEM for inter-SM communication (AllReduce), thus minimizing costly data transfers to/from the main GPU memory (HBM).
- Parallelization across the sequence length is achieved by updating the TTT hidden state in mini-batches of tokens.

Evaluation:

TTT-MLP was compared against baselines including local attention (no long-range mechanism), TTT-Linear (TTT with a linear model as hidden state), Mamba 2 [dao2024mamba2], Gated DeltaNet [yang2025gateddeltanetworksimproving], and sliding-window attention on generating 63-second videos.
Human evaluation using pairwise comparisons focused on four axes: Text following, Motion naturalness, Aesthetics, and Temporal consistency.
On one-minute videos, TTT-MLP significantly outperformed all baselines, leading the second-best method by 34 Elo points on average, with notable gains in temporal consistency (+38 Elo) and motion naturalness (+39 Elo).
An elimination round on 18-second videos showed Gated DeltaNet performed best, indicating that for shorter contexts (~100k tokens), simpler RNNs might be more effective or TTT-MLP's advantages become less pronounced.

Limitations and Future Work:

The generated videos, while demonstrating complex stories, still contain artifacts like object morphing at segment boundaries, unnatural physics, and inconsistent lighting/camera work, possibly inherited from the base CogVideo-X model.
TTT-MLP is computationally more expensive (slower inference and training) than Mamba 2 or Gated DeltaNet, despite optimizations.
Future directions include further optimizing the TTT-MLP kernel, exploring better ways to integrate TTT layers, using different backbones, and scaling to longer videos potentially with even larger neural networks as hidden states within TTT layers.

In summary, the paper demonstrates that using neural networks as expressive hidden states within RNN-like TTT layers allows Diffusion Transformers to generate coherent, minute-long videos with complex narratives, outperforming existing methods reliant on standard self-attention or simpler RNN state representations for long-context modeling. The work provides practical implementation details, including architectural modifications, a hybrid local-global processing strategy, multi-stage training, and specialized GPU kernels.