ContentV: Efficient Training of Video Generation Models with Limited Compute (2506.05343v2)

Published 5 Jun 2025 in cs.CV

Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: https://contentv.github.io.

Summary

The paper presents ContentV, a novel text-to-video model that adapts pre-trained image generation architectures with a 3D-VAE for enhanced efficiency.
It employs a multi-stage curriculum strategy and reinforcement learning with human feedback to improve spatial semantics and temporal coherence.
Achieving an 85.14 VBench score, ContentV demonstrates competitive video quality while operating with limited compute resources.

Overview of ContentV Video Generation Model

The paper "ContentV: Efficient Training of Video Generation Models with Limited Compute" presents ContentV, a novel text-to-video generation model developed to address the formidable computational demands typically associated with video generation. The model's architecture is built with efficiency in mind, demonstrated by its competitive performance achieved within resource-constrained settings using Neural Processing Units (NPUs) rather than GPUs.

ContentV is characterized by three key innovations:

Minimalist Architecture: The model leverages pre-trained image generation models, specifically Stable Diffusion 3.5 Large, into video generation by adapting the architecture minimally. This is achieved by substituting the 2D Variational Autoencoder (VAE) with a 3D-VAE capable of handling both image and video inputs, and incorporating 3D positional encoding to manage longer sequences.
Multi-stage Training Strategy: A progressive curriculum training strategy is applied to accommodate varying video resolutions and durations. The model initially focuses exclusively on video data, followed by joint video-image optimization to balance spatial semantics and temporal coherence. Training progresses through phases, starting from low resolution and short duration to high resolution with longer video sequences.
Cost-effective RLHF: Reinforcement learning with human feedback (RLHF) is employed without additional annotations to enhance video quality by refining the model's alignment with human aesthetic preferences.

Despite its modest size of 8 billion parameters, ContentV achieves a remarkable score of 85.14 on VBench, showcasing its efficacy across multiple dimensions, including semantic quality and human preference alignment, while maintaining state-of-the-art video generation capabilities.

Numerical Results and Claims

ContentV stands out by achieving competitive performance in the field of video generation, traditionally dominated by computationally expensive models. The VBench score of 85.14 places ContentV among the top ranks, outperforming several models, such as CogVideoX-5B and HunyuanVideo-13B, suggesting its superior capability in generating high-quality, diverse video content.

Moreover, extensive evaluations through VBench and VideoAlign measure the model across visual quality (VQ), motion quality (MQ), and text alignment (TA). ContentV demonstrates significant improvements, particularly in TA, achieved through a refined captioning approach, indicating effective semantic text-video alignment.

Implications and Future Developments

The innovations introduced with ContentV propose a viable path for scalable video generation in environments with limited computational resources. The reduced hardware requirements via NPU utilization open possibilities for democratizing text-to-video synthesis across various settings, potentially lowering entry barriers for industries focused on content creation.

The minimalist approach to architecture adaptation, parallel training techniques, and RLHF framework collectively suggest potential for broader application in other generative tasks. Future developments could explore further architectural efficiencies and optimizations in reinforcement learning feedback mechanisms to continue improving generative quality across modalities.

ContentV contributes to academic and commercial fields by demonstrating it is possible to achieve high-quality video generation without resorting to sprawling infrastructures typically required by models like MovieGen. As interest in video generation grows, the methodologies explored in this paper could catalyze accelerated progress in generative AI, making it more accessible to developers and researchers with computational constraints.

PDF Markdown

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos