The paper introduces Goku, a family of joint image-and-video generative models based on rectified flow Transformers, targeting industry-grade performance. The paper emphasizes four main components: data curation, model architecture design, flow formulation, and training infrastructure.
The data curation pipeline integrates video and image filtering using aesthetic scores, OCR-driven content analysis, and subjective evaluations. MLLMs (Multimodal LLMs) are used to generate dense captions, refined by an LLM to improve accuracy and fluency. This process resulted in a training dataset of approximately 36M video-text pairs and 160M image-text pairs.
The Goku model family uses Transformer architectures with 2B and 8B parameters, employing a 3D joint image-video VAE (Variational Autoencoder) to compress inputs into a shared latent space. This latent space is coupled with a full-attention mechanism.
To support large-scale training, the authors developed a robust infrastructure incorporating parallelism strategies to manage memory during long-context training. ByteCheckpoint is used for high-performance checkpointing, and fault-tolerant mechanisms from MegaScale ensure stability across large GPU clusters.
Key aspects of the Goku model are:
- Image-Video Joint VAE: A jointly trained Image-Video VAE handles both image and video data. For videos, a compression stride of is applied across height, width, and temporal dimensions, while for images, the compression stride is in spatial dimensions.
- : raw video input
- : temporal dimension
- : height
- : width
- Transformer Architectures: The Goku Transformer block builds upon GenTron, with a self-attention module, a cross-attention layer to integrate textual conditional embeddings (extracted via the Flan-T5 LLM), an FFN (Feed-Forward Network) for feature projection, and a layer-wise adaLN-Zero block. The model incorporates plain full attention, Patch n’ Pack, 3D RoPE (Rotary Position Embedding) position embedding, and Q-K Normalization.
- Flow-based Training: The formulation is based on the rectified flow (RF) algorithm, where a sample is progressively transformed from a prior distribution to the target data distribution. Given a real data sample and a noise sample , a training example is constructed through linear interpolation:
- : real data sample from the target distribution
- : noise sample from the prior distribution
- : interpolation coefficient
- : intermediate sample
- : velocity (time derivative of )
The training process involves multi-stage training: text-semantic pairing, image-and-video joint learning, and modality-specific fine-tuning. Cascaded resolution training is adopted in the second stage.
The infrastructure optimization includes model parallelism strategies such as Sequence-Parallelism (SP) and Fully Sharded Data Parallelism (FSDP). Fine-grained Activation Checkpointing (AC) and fault tolerance mechanisms from MegaScale are also used. ByteCheckpoint is adopted as the checkpointing solution.
The data curation pipeline consists of image and video collection, video extraction and clipping, image and video filtering, captioning, and data distribution balancing. The training dataset includes 100M public samples from LAION and 60M high-quality, internal samples for text-to-image, and 11M public clips and 25M in-house clips for text-to-video. The video classification model assigns a semantic tag to each video based on four evenly sampled keyframes, categorizing videos into 9 primary classes and 86 subcategories.
In text-to-image generation, Goku-T2I demonstrates strong performance across benchmarks like T2I-CompBench, GenEval, and DPG-Bench. In text-to-video benchmarks, Goku-T2V achieves state-of-the-art performance on the UCF-101 zero-shot generation task, and attains a score of 84.85 on VBench. For image-to-video adaptation, the model uses the first frame of each clip as the reference image, broadcasting and concatenating corresponding image tokens with paired noised video tokens, and introducing a single MLP (Multilayer Perceptron) layer for channel alignment.
Ablation studies show that model scaling helps mitigate distorted object structures, and joint image-and-video training enhances the generation of photorealistic frames.