Cosmos Tokenizer: Scalable Video Encoding

Updated 1 January 2026

Cosmos Tokenizer is a scalable, spatio-temporal encoding system that converts raw RGB video into compressed latent tokens for efficient downstream processing.
It employs a causal autoencoder with Haar wavelet transforms, factorized convolutions, and attention mechanisms to preserve spatio-temporal context.
Its dual continuous and discrete token streams enable efficient training, inference, and compression in world foundation models and token-native pipelines.

The Cosmos Tokenizer is a scalable, spatio-temporal video encoding system designed to convert raw RGB video frames into compressed sequences of latent tokens usable by downstream world foundation models. Positioned as the entry point within both the Cosmos World Foundation Model Platform and token-native video compression pipelines, the tokenizer enables efficient model training and inference by trading minimal loss in reconstruction quality for dramatic gains in computational throughput and model scalability. The system’s core employs a causal autoencoder backbone with factorized convolutions and attention, Haar wavelet transforms, adaptive normalization, and quantization schemes to yield both continuous and discrete token streams. These representations are foundational for large-scale world modeling and semantics-aware video compression (NVIDIA et al., 7 Jan 2025, Zhou et al., 22 Apr 2025).

1. Encoder-Decoder Architecture and Causal Tokenization

The Cosmos Tokenizer utilizes an autoencoder backbone to compress video into token grids. Input frames $x_{0:T}\in\mathbb{R}^{(T+1)\times H\times W\times 3}$ first undergo a two-level 3D Haar wavelet decomposition, followed by cascaded, causal 3D convolutions—2D spatial $(1\times k\times k)$ and temporal $(k\times 1\times 1)$ with left padding—alongside causal, factorized self-attention blocks alternating over space and time. Adaptive LayerNorm + LoRA (AdaLN-LoRA) residuals are employed for normalization and efficient parameter adaptation. Downsampling factors $(s_{HW}, s_T)$ reduce spatial and temporal grid sizes, so that $H'=H/s_{HW}$ , $W'=W/s_{HW}$ , and $T'=T/s_T$ . The decoder mirrors the encoder, integrating upsampling blocks to reconstruct video from the compressed tokens.

Causality is enforced in attention masks, restricting tokens to attend only to past or contextually “visible” neighbors. The architecture is compatible with both single-frame images and multi-frame video, permitting flexible deployment in image and video modeling workflows (NVIDIA et al., 7 Jan 2025, Zhou et al., 22 Apr 2025).

2. Continuous and Discrete Token Streams

Token outputs are bifurcated into continuous and discrete representations.

Continuous tokens, used by diffusion world foundation models (WFMs) and TVC pipelines, are direct encoder outputs $z_{0:T'}\in\mathbb{R}^{(T'+1)\times H'\times W'\times C}$ , with $C=16$ (Cosmos WFMs) or $d_C\approx512$ (TVC). For transmission, these are further quantized: $\hat y^C = \mathrm{round}(y^C / s) \times s$ , where $s$ is the quantization step.
Discrete tokens, as consumed by autoregressive WFMs, are obtained via Finite-Scalar-Quantization (FSQ) over a 6-dimensional latent. The FSQ scheme, with partitioning levels $(8,8,8,5,5,5)$ , yields a vocabulary $|V|=64{,}000$ codes. Quantization is defined:

$q_{i,j,k,\tau} = \argmin_{c\in V} \|z_{\tau,i,j} - e_c\|_2,$

using a learnable codebook $\{e_c\}_{c\in V}$ (NVIDIA et al., 7 Jan 2025). In TVC, FSQ codebooks have typical sizes of $K=4096\text{–}8192$ , with exponential-moving-average (EMA) updates per token assignment (Zhou et al., 22 Apr 2025).

These tokenizations enable downstream models to process video not in raw pixel space but in compact token grids, dramatically reducing memory and computational requirements.

3. Tokenizer Losses and Optimization

Training involves multi-stage objectives. The first stage employs reconstruction loss $\mathcal{L}_1 = \sum_{t=0}^T \|\hat x_t - x_t\|_1$ and perceptual loss computed over VGG features:

$\mathcal{L}_\mathrm{perc} = \frac{1}{L} \sum_{l=1}^L \sum_t \alpha_l \|\mathrm{VGG}_l(\hat x_t) - \mathrm{VGG}_l(x_t)\|_1$

Temporal smoothness is enforced via optical flow matching:

$\mathcal{L}_\mathrm{flow} = \frac{1}{T} \sum_{t=1}^T \|\mathrm{OF}(\hat x_t, \hat x_{t-1}) - \mathrm{OF}(x_t, x_{t-1})\|_1$

and sharpness via Gram matrix losses:

$\mathcal{L}_\mathrm{gram} = \frac{1}{L} \sum_{l=1}^L \sum_{t=0}^T \alpha_l \|GM_l(\hat x_t) - GM_l(x_t)\|_1$

FSQ quantizer commitment loss and codebook update are present:

$\mathcal{L}_\mathrm{FSQ} = \|\mathrm{sg}[f^D] - e\|_2^2 + \beta \|f^D - \mathrm{sg}[e]\|_2^2$

where $\mathrm{sg}[\cdot]$ is the stop-gradient operator, and $\beta \approx 1$ (Zhou et al., 22 Apr 2025).

4. Context Modeling for Token Compression

Contextual dependencies are preserved by checkerboard context models for both discrete and continuous streams.

Discrete CCM-D: The $3$D token grid is split into interleaved subsets (even/odd by $(t+h+w)\%2$ ), with visible tokens $A$ encoded first, then masked tokens $B$ conditioned on $A$ and prior codes. Probability modeling uses small $3$D CNNs under a $3\times3\times3$ causal/checkerboard mask.

$p(\mathrm{idx}) = \prod_{i\in A} p(\mathrm{idx}_i \mid \mathrm{idx}_{<i}) \times \prod_{j\in B} p(\mathrm{idx}_j \mid \mathrm{idx}_{<j}, \mathrm{idx}_A)$

Continuous CCM-C: Latents are entropy-coded as a factorized product of Gaussians with hyperprior $z$ :

$p(\hat y^C \mid z) = \prod_i \mathcal{N}(\hat y^C_i; \mu_i(z), \sigma_i^2(z))$

and the expected bit-cost reflects the negative log-likelihood under this coding. Bitrate targets range from $0.008$ to $0.02$ bpp in TVC (Zhou et al., 22 Apr 2025).

5. Integration with World Foundation Models and Applications

Cosmos tokenization underpins several model classes:

Diffusion WFMs: Continuous tokens are patchified and input to DiT-style U-Transformers, which denoise latents for reconstruction. Applications include camera control scenarios, where tokens are generated conditioned on navigation trajectories (NVIDIA et al., 7 Jan 2025).
Autoregressive WFMs: Discrete tokens are embedded with rotary positional encoding, then decoded by GPT-style architectures for predictive or generative tasks. Robotic manipulation models utilize these for “what-if” video rollouts conditioned on text or action vectors (NVIDIA et al., 7 Jan 2025).
Tokenized Video Compression (TVC): TVC leverages dual token streams for ultra-low bitrate video coding. The Cosmos tokenizer’s outputs enable strategic masking and prediction (with up to $75\%$ masked discrete tokens), lossless entropy coding, and multi-scale fusion through ControlNet for perceptual fidelity. Ablation studies reveal sensitivity to context models, cross-stream attention, and temporal modeling (Zhou et al., 22 Apr 2025).

Downstream deployments encompass autonomous driving model training (using multi-view tokens), large-scale simulation, and Physical AI research. Tokenizer throughput on an A100 80 GB GPU ranges from $\sim35$ ms per frame (CV configuration) to $\sim64$ ms per image (DI configuration), with vocabularies up to $64{,}000$ tokens (NVIDIA et al., 7 Jan 2025).

6. Positional Encoding and Spatio-Temporal Representations

Cosmos integrates 3D-factorized Rotary Positional Embedding (RoPE) for precise modeling of tokens' spatial and temporal locations, supplemented by learnable absolute position embeddings per block. In TVC, positional embeddings are segregated for space and time segments, ensuring causal attention flows exclusively to “known” context. This structure facilitates fusion of multi-view observations and supports both video and high-resolution image compression (NVIDIA et al., 7 Jan 2025, Zhou et al., 22 Apr 2025).

7. Empirical Outcomes and Impact

The compression provided by Cosmos Tokenizer makes training and inference on datasets exceeding $10^8$ hours of video feasible for world foundation model pipelines. Quantitative benchmarks include:

CV4×8×8: PSNR $\approx32.80$ dB, SSIM $\approx0.900$ on DAVIS, rFVD $\approx15.9$ for $49$ frames.
DV4×8×8: PSNR $\approx28.81$ dB, SSIM $\approx0.818$ , rFVD $\approx37.4$ for $17$ frames (NVIDIA et al., 7 Jan 2025).

In autonomous driving, Cosmos tokenization yields competitive FID and FVD metrics, and multi-view Sampson error reductions to $\sim0.68$ pixel units from $1.24$ baseline (NVIDIA et al., 7 Jan 2025). TVC demonstrates perceptual robustness at ultra-low bitrates, with LPIPS ablation showing increased artifact sensitivity when disabling context or fusion modules (Zhou et al., 22 Apr 2025).

A plausible implication is that token-native architectures, as enabled by Cosmos, will continue to facilitate scalable, semantics-aware modeling and compression, supporting large-scale Physical AI research and deployment.

Table: Token Stream Characteristics (Cosmos and TVC)

Token Stream	Dimensionality/Config	Compression Role
Continuous (Cosmos, TVC)	$z_{0:T'}\in\mathbb{R}^{(T'+1)\times H'\times W'\times C}$	Captures fine spatial detail, denoised via diffusion U-Transformers
Discrete (FSQ)	Code indices; $K=4096\text{–}64,000$	Encodes semantic/structural info; used in AR models, checkerboard context coding
Masked Discrete (TVC)	Up to $75\%$ masked	Enables context prediction and compression efficiency

Continuous and discrete token streams serve complementary purposes—precision in semantic structure versus detail in appearance—thereby supporting model scale and fidelity in world modeling and tokenized video compression contexts (NVIDIA et al., 7 Jan 2025, Zhou et al., 22 Apr 2025).

PDF Markdown Chat (Pro)

References (2)

Cosmos World Foundation Model Platform for Physical AI (2025)

TVC: Tokenized Video Compression with Ultra-Low Bitrate (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Cosmos Tokenizer.