InfinityStar: Unified Spacetime Autoregressive Model
- InfinityStar is a unified spacetime autoregressive framework that generates high-fidelity images and dynamic videos using discrete temporal-spatial tokenization.
- It employs a joint spacetime pyramid and sparse attention to decouple spatial detail from temporal motion, enabling efficient industrial-grade 720p video synthesis.
- The framework accelerates inference significantly while supporting versatile tasks like text-to-image, text-to-video, and image-to-video synthesis.
InfinityStar is a unified spacetime autoregressive framework designed for high-resolution image and dynamic video synthesis. Distinguished by its purely discrete modeling, InfinityStar jointly captures spatial and temporal dependencies in a single architecture, supporting diverse generation tasks—including text-to-image (T2I), text-to-video (T2V), image-to-video (I2V), and long interactive video synthesis—through direct temporal autoregression. InfinityStar achieves industrial-grade 720p video synthesis with significantly accelerated inference, notably reaching superior visual benchmarks and efficiency compared to both autoregressive and diffusion-based video generation competitors.
1. Motivation and Conceptual Foundation
Visual generation tasks have traditionally been dominated by two paradigms: diffusion models, which offer high quality at the expense of slow, fixed-length inference, and naïve autoregressive models, which can stream content but suffer from lower fidelity and are impractically slow at high resolutions. InfinityStar pursues a unified approach that accurately models both static spatial appearance and dynamic temporal transitions within a discrete autoregressive pipeline. The framework is designed to:
- Decouple spatial appearance and temporal dynamics through a joint spacetime pyramidal tokenization.
- Support a comprehensive suite of generation functions—text→image, text→video, image→video, and interactive long-form video synthesis.
- Achieve visual fidelity matching or exceeding contemporary state-of-the-art methods.
- Realize efficient inference capable of producing 5 s, 720p videos within seconds, orders-of-magnitude faster than prior models.
Key innovations anchor InfinityStar: spacetime pyramid modeling, efficient causal attention in spacetime via sparse mechanisms and rotary positional encodings, and bitwise self-correction with reduced-memory classifiers.
2. Discrete Spacetime Autoregressive Formulation
2.1 Joint Distribution Modeling
A video is segmented into clips (), each encompassing scales of residual token blocks , where is the clip frame length and scale with . The total autoregressive likelihood is expressed as:
For , , comprising the “image pyramid” specialization for T2I; clips represent dynamic content.
2.2 Tokenization and Positional Encoding
A pre-trained continuous video VAE provides initial representations. Between encoder and decoder, InfinityStar inserts a Binary Spherical Quantizer (BSQ), yielding multi-scale residual quantization: at each scale , spatial dimensions and vocabulary size are defined. During training, Stochastic Quantizer Depth (SQD) randomly drops each of the last scales with probability , ensuring information compression into earlier scales and balancing representational load.
Positional encoding employs “Spacetime RoPE,” decomposing standard rotary embeddings into scale ID, time ID, height ID, and width ID, thus encoding each token’s full 4D position .
2.3 Loss Function and Training Objective
InfinityStar adopts autoregressive maximum likelihood (cross-entropy) over bitwise outputs:
Bitwise self-correction is applied by randomly flipping input bits (with probability) during training, re-computing targets accordingly, and predicting bits per token block (as opposed to classes), which improves memory efficiency and optimization.
Tokenizer fine-tuning substitutes the usual VAE KL loss for a combination of commitment loss and entropy penalty at each BSQ quantization step.
3. Model Architecture and Computational Mechanisms
3.1 Autoregressive Transformer Backbone
InfinityStar’s architecture leverages a standard Transformer backbone with approximately 8 billion parameters, trained in four stages:
- T2I pre-training at resolution
- T2V fine-tuning successively at 192p, 480p, and finally 720p
Self-attention blocks employ causal masking:
- Intra-clip: block-wise causal masks restrict each token’s attention to preceding scales and previous tokens within a scale.
- Inter-clip: Spacetime Sparse Attention (SSA)—each clip attends exclusively to the last scale of its predecessor, maintaining inter-clip consistency while minimizing memory overhead.
3.2 Specialized Architectural Modules
- Semantic Scale Repetition (SSR): The first semantic scales are repeated times to refine global structure and dynamics; these early scales are low-resolution but crucial for semantic integrity.
- Spacetime RoPE: Implements decomposed rotary position encoding for efficient cross-scale and cross-time embedding.
3.3 Sampling and Inference
Sampling follows greedy or temperature-controlled procedures over bitwise outputs. Diversity can be increased with top- or top- sampling, but greedy bitwise decoding yields high-fidelity results without additional heuristics.
4. Supported Generation Tasks
InfinityStar’s unified design naturally accommodates multiple modalities:
| Task | Conditioning Inputs | Output Structure |
|---|---|---|
| Text-to-Image (T2I) | Text embedding | Clip , |
| Text-to-Video (T2V) | Text embedding (inherited) | Clips |
| Image-to-Video (I2V) | First frame of input image, prompt “continue” | Clips |
| Long Interactive Video | Frame, semantic-detail, user instruction | Indefinite windows |
For long interactive video synthesis, a sliding window mechanism operates across 10 s segments with a 5 s stride, conditioning each new cycle on the initial frame, the last clip’s semantic details, and new user instructions. This supports continuous, memory-efficient generation over extended durations.
5. Empirical Performance and Ablation Analysis
InfinityStar demonstrates strong quantitative and qualitative results:
- VBench Score: 83.74 overall, outperforming autoregressive models (Nova 80.12, Emu3 80.96) and several diffusion competitors (HunyuanVideo 83.24, OpenSora 79.23 at lower parameter count). Diffusion model Wan-2.1 reaches 84.70.
- Generation Speed: Synthesis of a 5 s, 81-frame 720p video requires 58 s (×32 speedup over Wan 2.1 diffusion, ×6 over Nova AR), measured end-to-end on a single high-end GPU.
- Ablation Studies:
- Removal of SSR: drops VBench score from 81.28 to 75.72 (192p ablation).
- Pseudo-spacetime pyramid: 80.30 vs. 81.28 for full spacetime.
- Disabling SQD: 81.07 vs. 81.28.
- Full attention (≈57 GB VRAM) vs. SSA (≈40 GB VRAM): little difference in score (81.28 vs. 80.77), substantial efficiency gain.
Notably, InfinityStar is the first discrete autoregressive video generator verified to produce industrial-grade 720p outputs.
6. Design Impact, Limitations, and Open Directions
InfinityStar signifies a shift toward unifying image and video synthesis within a single discrete autoregressive framework. The spacetime pyramid architecture effectively decouples static appearance from dynamic motion, enabling versatile task handling. Bitwise self-correction and sparse attention mechanisms contribute substantially to efficiency, while leveraging continuous VAEs and SQD yields fast, compact, and high-fidelity token representation.
Open research questions include narrowing fidelity gaps in high-motion scenes, scaling model size and compute to parity with advanced diffusion models, further inference acceleration (e.g., kernel fusion, early-exit strategies), and expanding semantic-detail conditioning for richer multi-agent or interactive scenarios.
The release of InfinityStar’s code and models is intended to support reproducibility and ongoing innovation in efficient, high-quality visual generation. See https://github.com/FoundationVision/InfinityStar for resources.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free