Cosmos-Predict2-2B Video2World Modeling

Updated 4 June 2026

Cosmos-Predict2-2B-Video2World is a framework that enables large-scale, self-supervised video prediction using V-JEPA 2, focusing on robust spatiotemporal representation learning.
It employs a masked latent-prediction objective combined with frozen encoder protocols and lightweight downstream heads to tackle tasks in recognition, anticipation, and dense prediction.
Empirical results demonstrate state-of-the-art performance in action recognition, physical prediction, and robustness metrics, highlighting its strong domain transfer capabilities.

Cosmos-Predict2-2B-Video2World is a research term used to denote large-scale self-supervised video models that leverage the V-JEPA 2 architecture—especially in its 2B-parameter regime—for data-centric predictive world modeling in video domains. Typical instantiations involve a frozen, foundation-scale V-JEPA 2.1 or V-JEPA 2 backbone pretrained on internet-scale or domain-specific video, with lightweight attentive probes or classification heads trained for recognition, anticipation, or dense prediction tasks. The resulting systems exhibit state-of-the-art spatiotemporal representation learning, robust generalization, and world modeling capabilities across action recognition, physical prediction, and foundation model evaluation protocols.

1. Architectural Foundation: V-JEPA 2/2.1 Video Modeling at Scale

V-JEPA 2 and V-JEPA 2.1 are Vision Transformer (ViT)–based architectures with explicit joint embedding predictive objectives in latent space. The input is tokenized into spatiotemporal tubelets (e.g., 2×16×16 in frames × height × width) and linearly embedded to dimension $D$ (e.g., 1408 for ViT-g, 1024 for ViT-L). Each input token receives either 3D Rotary Positional Encoding or learned sine–cosine embeddings, and, in V-JEPA 2.1, learnable modality tokens for multi-modal functionality (Mur-Labadia et al., 15 Mar 2026).

The encoder $\mathrm{E}_\theta$ is a stack of transformer layers ( $L=$ 24–48) yielding latent patch representations. During pretraining, a high-ratio spatiotemporal block mask $M$ hides large fractions of input tokens. Only visible context tokens are processed by the "x-encoder"; the "y-encoder" (EMA copy) encodes the ground-truth full input.

A transformer-based mask denoiser $P_\phi$ or multi-layer MLP receives (i) unmasked context token embeddings and (ii) learned mask tokens for spatially/temporally occluded locations and predicts the EMA encoder's output at masked positions. Notably, the predictive loss operates in encoder representation space (not pixels), of the form:

$\mathcal{L}_{\text{JEPA}} = \left\| P_\phi(\Delta,\,E_\theta(\text{masked}\;x)) - \mathrm{sg}\left[E_{\bar{\theta}}(x)\right] \right\|_1$

where $\mathrm{sg}[\cdot]$ is stop-gradient (Assran et al., 11 Jun 2025).

2. Pretraining Objectives and Scaling Principles

Cosmos-Predict2-2B-Video2World–style systems universally adopt a masked latent-prediction objective. V-JEPA 2.1 generalizes this to a "dense predictive loss" acting on both masked and visible tokens, with hierarchical deep supervision by applying losses at multiple intermediate encoder layers. The prediction head is trained to reconstruct encoder targets at every position, weighted to focus supervision near masked tokens:

$\mathcal{L}_\text{predict} = \frac{1}{|M|}\sum_{i\in M} \| P_\phi(E_\theta(x),\Delta_y)_i - \mathrm{sg}(E_{\bar{\theta}}(y)_i) \|_1$

$\mathcal{L}_\text{ctx} = \frac{1}{|C|}\sum_{i\in C} \lambda_i \| P_\phi(E_\theta(x),\Delta_y)_i - \mathrm{sg}(E_{\bar{\theta}}(y)_i) \|_1$

with $\lambda_i$ adaptively set by spatial–temporal distance to the nearest mask (Mur-Labadia et al., 15 Mar 2026).

Pretraining datasets often exceed 100 million images and millions of video clips, as in "VisionMix163M" (142M images, 19M videos, curated by relevance filters and zero-shot detection) (Mur-Labadia et al., 15 Mar 2026), or are highly domain-specific (e.g., PriVi’s primate corpus with 424 h video (Mueller et al., 12 Nov 2025)). Models scale up to ViT-G (2B params), and pretraining is conducted with batch sizes up to 3k using AdamW and cosine learning rate schedules (Mur-Labadia et al., 15 Mar 2026, Assran et al., 11 Jun 2025).

3. Frozen Encoder Probing and Downstream Classification Heads

Cosmos-Predict2-2B-Video2World pipelines universally employ frozen encoder protocols. After self-supervised pretraining, the backbone weights are locked, and only a lightweight classification head is trained for downstream tasks.

Head designs include:

Attentive probe: stacks of transformer blocks (typically 2–4) with a final cross-attention or learnable query pooling token, followed by a linear classifier (Assran et al., 11 Jun 2025, Mur-Labadia et al., 15 Mar 2026, Alrasheed et al., 15 May 2026).
Class tokens: learned per-class tokens that attend to all patch tokens in several self-attention layers, as used in domain-specific classification (e.g., primate action recognition in PriVi) (Mueller et al., 12 Nov 2025).
Non-parametric protocols: direct use of encoder-derived [CLS] tokens as feature vectors for k-NN classification or clustering, especially in ablation and representation analyses (Kodathala et al., 25 Sep 2025).

Losses are either softmax cross-entropy (single label) or sigmoid + per-class binary cross-entropy (multi-label).

4. Implementation, Training Pipelines, and Hyperparameters

Typical end-to-end workflow:

Data curation: Build a large-scale video corpus relevant to the deployment domain. Use CLIP-based filtering, shot detection, and zero-shot object detectors to select in-domain frames (Mueller et al., 12 Nov 2025).
Self-supervised pretraining: Apply masked latent-prediction objectives for 8–100 epochs over the video corpus with large batch sizes. Standard hyperparameters: AdamW, lr between $\mathrm{E}_\theta$ 0 (PriVi) and $\mathrm{E}_\theta$ 1 (V-JEPA 2.1), consistent weight decay, and EMA teacher momentum 0.99925. Mask ratios are typically 0.6–0.9 (Mur-Labadia et al., 15 Mar 2026, Mueller et al., 12 Nov 2025, Li et al., 29 Sep 2025).
Probe/Head training: With all encoder parameters frozen, train classification heads using AdamW, learning rates $\mathrm{E}_\theta$ 2 to $\mathrm{E}_\theta$ 3, with warmup and cosine decay. Data augmentation follows recent best practices: random resized crop, color jitter, horizontal flip, and temporal jitter (Assran et al., 11 Jun 2025, Mueller et al., 12 Nov 2025).
Inference: Average predictions over multiple spatial crops or temporal views (Mueller et al., 12 Nov 2025).

5. Empirical Results and Benchmark Comparisons

Cosmos-Predict2-2B-Video2World–style V-JEPA 2.1/2-based frozen backbone systems achieve state-of-the-art results across diverse domains:

Something-Something v2: V-JEPA 2.1 ViT-G 77.7% top-1 frozen-probe (Mur-Labadia et al., 15 Mar 2026), V-JEPA 2 ViT-g 77.3%, surpassing InternVideo2s-1B (69.7% freeze) (Assran et al., 11 Jun 2025).
Kinetics-400: 87.7% top-1 (V-JEPA 2.1 ViT-G), matching InternVideo2s-1B (89.4%), outperforming VideoMAEv2 and VideoPrism under equivalent protocols.
Fine-grained/robustness: V-JEPA 2.1 outperforms all sub-2B models in depth estimation (0.307 RMSE on NYUv2) and achieves high class-retention (45.7%) under severe ImageNet-C corruptions. Under occlusion and corruption, frozen V-JEPA 2 models can outperform fully fine-tuned VideoMAE or supervised TimeSformer models in robustness metrics, even if clean top-1 is modestly lower (Alrasheed et al., 15 May 2026).
Domain transfer: Domain-specific pretraining on corpora like PriVi substantially improves balanced accuracy and mAP on low-label, high-variance datasets, outperforming transfer from human-centric or generic backbones (Mueller et al., 12 Nov 2025).

6. Practical Extensions and Design Variants

The foundation of Cosmos-Predict2-2B-Video2World is readily adapted to multiple specialized tasks:

Domain adaptation: Custom pretraining on filtered, unlabeled in-domain videos, followed by probe training, can yield strong low-shot performance with rapid scaling as more data become available (Mueller et al., 12 Nov 2025).
Attentive pooling variants: Probes can use single-query transformers, per-class tokens, or non-parametric [CLS]-pooling, according to the required trade-off between localization and global context (Assran et al., 11 Jun 2025, Mueller et al., 12 Nov 2025).
Compute-efficient variants: The SALT protocol (frozen teacher, decoupled pixel- and latent-space objectives) improves compute–accuracy scaling and maintains representation quality (Li et al., 29 Sep 2025).
Robustness and temporal directionality: Latent-prediction frameworks (V-JEPA 2.1/2) uniquely encode temporal order and, when subjected to corruption or occlusion, degrade more gracefully while preserving class structure (Alrasheed et al., 15 May 2026).

7. Context, Impact, and Comparative Analysis

The Cosmos-Predict2-2B-Video2World paradigm marks a pronounced shift to large-scale, data-centric, self-supervised learning for spatiotemporal representation and prediction. Comparative studies demonstrate distinct architectural trade-offs: V-JEPA-derived features are more robust and less variable across static and dynamic actions than frame-based (spatial) models such as DINOv3 (Kodathala et al., 25 Sep 2025). Systems built on this foundation are increasingly viewed as "world models," with efficacy not just in recognition but in prediction, anticipation, planning, and downstream physical manipulation in robotics (Assran et al., 11 Jun 2025).

A plausible implication is the continued consolidation of large, robust, multimodal pretraining backbones as universal adapters for specialized downstream video understanding and world modeling tasks. Prospective research avenues include the integration of language-conditioned behavior specification, zero-shot adaptation, and efficient distillation for mobile deployment, as well as continued analysis of scaling and robustness trends in the latent prediction regime (Mueller et al., 12 Nov 2025, Mur-Labadia et al., 15 Mar 2026).