Seaweed-7B: Cost-Effective Training of a Video Generation Foundation Model
This technical report introduces Seaweed-7B, a diffusion transformer-based foundation model for video generation. Comprising approximately 7 billion parameters, Seaweed-7B is trained from scratch utilizing 665,000 H100 GPU hours, distinguished by its resource-efficient design and competitive performance against much larger models. The report articulates the model's architecture, the strategic design decisions made during development, and the empirical evaluations underscoring its effectiveness.
Architectural Design and Empirical Evaluations
At the core of Seaweed-7B's design is a diffusion transformer (DiT) architecture, which is optimized for low-resource settings without compromising on performance quality. Specific design choices such as hybrid-stream structures for improved parameter efficiency and the use of multi-resolution rotary position embedding (MM-RoPE) significantly enhance convergence and model scalability. The model employs a variational autoencoder (VAE) for encoding image and video data into a compact latent space. The VAE incorporates causal convolutional architectures that facilitate effective video and image compressions while maintaining reconstruction fidelity.
Empirical evaluations on standard benchmarks reveal that Seaweed-7B delivers high-level performance, often surpassing larger contemporary models trained with significantly more computational resources. In particular, Seaweed-7B demonstrates superior generalization abilities across a range of downstream tasks, such as text-to-video and image-to-video generation. The model's win ratio in human evaluations further validates its competitive stance, achieving high rankings despite its moderate size.
Training Strategies and Practical Implications
A multi-stage training strategy is adopted, beginning with low-resolution image and video data, progressing to higher-resolution content, and leveraging both supervised fine-tuning and reinforcement learning from human feedback (RLHF). This iterative process ensures that Seaweed-7B aligns visual and textual components accurately, reinforcing aesthetic and motion quality while managing computational demands efficiently.
The practical implications of such an approach are substantial. Seaweed-7B can be effectively adapted for various real-world applications, including human video generation, long-video storytelling, and real-time generation tasks, either through zero-shot learning or lightweight fine-tuning. Its cost-efficiency in both training and inference marks a significant step toward democratizing access to high-quality video generation technology.
Theoretical Contributions and Future Directions
The theoretical underpinnings of Seaweed-7B challenge prevailing paradigms that emphasize model size over efficiency. The findings suggest that through clever architectural and procedural optimizations, mid-sized models can achieve performance parity with larger counterparts, prompting a reevaluation of current video generation research trajectories.
Looking ahead, future research may focus on refining these design principles to further exploit the latent potential of medium-scale models. There is also scope for investigating the broader application of hybrid models integrating autoregressive and deterministic components across multimodal AI frameworks.
Overall, Seaweed-7B exemplifies a pivotal stride in video generation research, underscoring the viability and benefits of leveraging resource-constrained environments to achieve robust performance outcomes. Its development not only contributes to current methodologies but also inspires further exploration in the landscape of cost-effective AI models.