Goku: Flow Based Video Generative Foundation Models (2502.04896v2)

Published 7 Feb 2025 in cs.CV

Abstract: This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

PDF Abstract

The paper introduces Goku, a family of joint image-and-video generative models based on rectified flow Transformers, targeting industry-grade performance. The paper emphasizes four main components: data curation, model architecture design, flow formulation, and training infrastructure.

The data curation pipeline integrates video and image filtering using aesthetic scores, OCR-driven content analysis, and subjective evaluations. MLLMs (Multimodal LLMs) are used to generate dense captions, refined by an LLM to improve accuracy and fluency. This process resulted in a training dataset of approximately 36M video-text pairs and 160M image-text pairs.

The Goku model family uses Transformer architectures with 2B and 8B parameters, employing a 3D joint image-video VAE (Variational Autoencoder) to compress inputs into a shared latent space. This latent space is coupled with a full-attention mechanism.

To support large-scale training, the authors developed a robust infrastructure incorporating parallelism strategies to manage memory during long-context training. ByteCheckpoint is used for high-performance checkpointing, and fault-tolerant mechanisms from MegaScale ensure stability across large GPU clusters.

Key aspects of the Goku model are:

Image-Video Joint VAE: A jointly trained Image-Video VAE handles both image and video data. For videos, a compression stride of $8 \times 8 \times 4$ $8 \times 8 \times 4$ is applied across height, width, and temporal dimensions, while for images, the compression stride is $8 \times 8$ $8 \times 8$ in spatial dimensions.
- $x \in \mathbb{R}^{T\times H \times W \times3}$ : raw video input
- $T$ : temporal dimension
- $H$ : height
- $W$ : width
Transformer Architectures: The Goku Transformer block builds upon GenTron, with a self-attention module, a cross-attention layer to integrate textual conditional embeddings (extracted via the Flan-T5 LLM), an FFN (Feed-Forward Network) for feature projection, and a layer-wise adaLN-Zero block. The model incorporates plain full attention, Patch n’ Pack, 3D RoPE (Rotary Position Embedding) position embedding, and Q-K Normalization.
Flow-based Training: The formulation is based on the rectified flow (RF) algorithm, where a sample is progressively transformed from a prior distribution to the target data distribution. Given a real data sample $\mathbf{x}_1$ $x_{1}$ and a noise sample $\mathbf{x}_0 \sim \mathcal{N}(0, 1)$ $x_{0} \sim N (0, 1)$ , a training example is constructed through linear interpolation:
- $\mathbf{x}_1$ : real data sample from the target distribution
- $\mathbf{x}_0$ : noise sample from the prior distribution
- $t$ : interpolation coefficient
- $\mathbf{x}_t$ : intermediate sample
- $\mathbf{v}_t$ : velocity (time derivative of $\mathbf{x}_t$ )

The training process involves multi-stage training: text-semantic pairing, image-and-video joint learning, and modality-specific fine-tuning. Cascaded resolution training is adopted in the second stage.

The infrastructure optimization includes model parallelism strategies such as Sequence-Parallelism (SP) and Fully Sharded Data Parallelism (FSDP). Fine-grained Activation Checkpointing (AC) and fault tolerance mechanisms from MegaScale are also used. ByteCheckpoint is adopted as the checkpointing solution.

The data curation pipeline consists of image and video collection, video extraction and clipping, image and video filtering, captioning, and data distribution balancing. The training dataset includes 100M public samples from LAION and 60M high-quality, internal samples for text-to-image, and 11M public clips and 25M in-house clips for text-to-video. The video classification model assigns a semantic tag to each video based on four evenly sampled keyframes, categorizing videos into 9 primary classes and 86 subcategories.

In text-to-image generation, Goku-T2I demonstrates strong performance across benchmarks like T2I-CompBench, GenEval, and DPG-Bench. In text-to-video benchmarks, Goku-T2V achieves state-of-the-art performance on the UCF-101 zero-shot generation task, and attains a score of 84.85 on VBench. For image-to-video adaptation, the model uses the first frame of each clip as the reference image, broadcasting and concatenating corresponding image tokens with paired noised video tokens, and introducing a single MLP (Multilayer Perceptron) layer for channel alignment.

Ablation studies show that model scaling helps mitigate distorted object structures, and joint image-and-video training enhances the generation of photorealistic frames.

PDF Markdown Bookmark Chat (Pro)

Authors (22)

Shoufa Chen (22 papers)
Chongjian Ge (23 papers)
Yuqi Zhang (54 papers)
Yida Zhang (3 papers)
Fengda Zhu (13 papers)
Hao Yang (328 papers)
Hongxiang Hao (6 papers)
Hui Wu (54 papers)
Zhichao Lai (3 papers)
Yifei Hu (13 papers)
Ting-Che Lin (1 paper)
Shilong Zhang (32 papers)
Fu Li (86 papers)
Chuan Li (70 papers)
Xing Wang (191 papers)
Yanghua Peng (18 papers)
Peize Sun (33 papers)
Ping Luo (340 papers)
Yi Jiang (171 papers)
Zehuan Yuan (65 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1888811723047764367

https://twitter.com/AdinaYakup/status/1888980555137700146

https://twitter.com/wolfinbusiness/status/1889097049163108815

https://twitter.com/venturetwins/status/1888967523636224185

https://twitter.com/SD_Tutorial/status/1888883565842276567

https://twitter.com/0xserus/status/1889240501032825050

YouTube

Show All Videos