InfinityStar: Unified Spacetime Autoregressive Model

Updated 10 November 2025

InfinityStar is a unified spacetime autoregressive framework that generates high-fidelity images and dynamic videos using discrete temporal-spatial tokenization.
It employs a joint spacetime pyramid and sparse attention to decouple spatial detail from temporal motion, enabling efficient industrial-grade 720p video synthesis.
The framework accelerates inference significantly while supporting versatile tasks like text-to-image, text-to-video, and image-to-video synthesis.

InfinityStar is a unified spacetime autoregressive framework designed for high-resolution image and dynamic video synthesis. Distinguished by its purely discrete modeling, InfinityStar jointly captures spatial and temporal dependencies in a single architecture, supporting diverse generation tasks—including text-to-image (T2I), text-to-video (T2V), image-to-video (I2V), and long interactive video synthesis—through direct temporal autoregression. InfinityStar achieves industrial-grade 720p video synthesis with significantly accelerated inference, notably reaching superior visual benchmarks and efficiency compared to both autoregressive and diffusion-based video generation competitors.

1. Motivation and Conceptual Foundation

Visual generation tasks have traditionally been dominated by two paradigms: diffusion models, which offer high quality at the expense of slow, fixed-length inference, and naïve autoregressive models, which can stream content but suffer from lower fidelity and are impractically slow at high resolutions. InfinityStar pursues a unified approach that accurately models both static spatial appearance and dynamic temporal transitions within a discrete autoregressive pipeline. The framework is designed to:

Decouple spatial appearance and temporal dynamics through a joint spacetime pyramidal tokenization.
Support a comprehensive suite of generation functions—text→image, text→video, image→video, and interactive long-form video synthesis.
Achieve visual fidelity matching or exceeding contemporary state-of-the-art methods.
Realize efficient inference capable of producing 5 s, 720p videos within seconds, orders-of-magnitude faster than prior models.

Key innovations anchor InfinityStar: spacetime pyramid modeling, efficient causal attention in spacetime via sparse mechanisms and rotary positional encodings, and bitwise self-correction with reduced-memory classifiers.

2. Discrete Spacetime Autoregressive Formulation

2.1 Joint Distribution Modeling

A video is segmented into $N$ clips ( $c = 1..N$ ), each encompassing $K$ scales of residual token blocks $r_k^c \in \{0,\ldots,2^d-1\}^{T \times h_k \times w_k}$ , where $T$ is the clip frame length and $(h_k, w_k)$ scale with $k$ . The total autoregressive likelihood is expressed as:

$p(r_{1:K}^1, \dots, r_{1:K}^N \mid \psi(t)) = \prod_{c=1}^N \prod_{k=1}^K p(r_k^c \mid r_{<k}^c, r_{1:K}^{<c}, \psi(t))$

For $c = 1$ , $T=1$ , comprising the “image pyramid” specialization for T2I; clips $c=2..N$ represent dynamic content.

2.2 Tokenization and Positional Encoding

A pre-trained continuous video VAE provides initial representations. Between encoder and decoder, InfinityStar inserts a Binary Spherical Quantizer (BSQ), yielding multi-scale residual quantization: at each scale $k$ , spatial dimensions $(h_{k}, w_{k})$ and vocabulary size $2^d$ are defined. During training, Stochastic Quantizer Depth (SQD) randomly drops each of the last $N$ scales with probability $p$ , ensuring information compression into earlier scales and balancing representational load.

Positional encoding employs “Spacetime RoPE,” decomposing standard rotary embeddings into scale ID, time ID, height ID, and width ID, thus encoding each token’s full 4D position $(\text{clip}, \text{scale}, y, x)$ .

2.3 Loss Function and Training Objective

InfinityStar adopts autoregressive maximum likelihood (cross-entropy) over bitwise outputs:

$\mathcal{L}_{\text{AR}} = -\sum_{c,k} \log p(r_k^c \mid r_{<k}^c, r_{1:K}^{<c}, \psi(t))$

Bitwise self-correction is applied by randomly flipping input bits (with probability) during training, re-computing targets accordingly, and predicting $d$ bits per token block (as opposed to $2^d$ classes), which improves memory efficiency and optimization.

Tokenizer fine-tuning substitutes the usual VAE KL loss for a combination of commitment loss and entropy penalty at each BSQ quantization step.

3. Model Architecture and Computational Mechanisms

3.1 Autoregressive Transformer Backbone

InfinityStar’s architecture leverages a standard Transformer backbone with approximately 8 billion parameters, trained in four stages:

T2I pre-training at $256^2$ resolution
T2V fine-tuning successively at 192p, 480p, and finally 720p

Self-attention blocks employ causal masking:

Intra-clip: block-wise causal masks restrict each token’s attention to preceding scales and previous tokens within a scale.
Inter-clip: Spacetime Sparse Attention (SSA)—each clip attends exclusively to the last scale of its predecessor, maintaining inter-clip consistency while minimizing memory overhead.

3.2 Specialized Architectural Modules

Semantic Scale Repetition (SSR): The first $K_s$ semantic scales are repeated $N$ times to refine global structure and dynamics; these early scales are low-resolution but crucial for semantic integrity.
Spacetime RoPE: Implements decomposed rotary position encoding for efficient cross-scale and cross-time embedding.

3.3 Sampling and Inference

Sampling follows greedy or temperature-controlled procedures over bitwise outputs. Diversity can be increased with top- $k$ or top- $p$ sampling, but greedy bitwise decoding yields high-fidelity results without additional heuristics.

4. Supported Generation Tasks

InfinityStar’s unified design naturally accommodates multiple modalities:

Task	Conditioning Inputs	Output Structure
Text-to-Image (T2I)	Text embedding $\psi(t)$	Clip $c=1$ , $T=1$
Text-to-Video (T2V)	Text embedding $\psi(t)$ (inherited)	Clips $c=1..N$
Image-to-Video (I2V)	First frame of input image, prompt “continue”	Clips $c=2..N$
Long Interactive Video	Frame, semantic-detail, user instruction $t^1$	Indefinite windows

For long interactive video synthesis, a sliding window mechanism operates across 10 s segments with a 5 s stride, conditioning each new cycle on the initial frame, the last clip’s semantic details, and new user instructions. This supports continuous, memory-efficient generation over extended durations.

5. Empirical Performance and Ablation Analysis

InfinityStar demonstrates strong quantitative and qualitative results:

VBench Score: 83.74 overall, outperforming autoregressive models (Nova 80.12, Emu3 80.96) and several diffusion competitors (HunyuanVideo 83.24, OpenSora 79.23 at lower parameter count). Diffusion model Wan-2.1 reaches 84.70.
Generation Speed: Synthesis of a 5 s, 81-frame 720p video requires 58 s (×32 speedup over Wan 2.1 diffusion, ×6 over Nova AR), measured end-to-end on a single high-end GPU.
Ablation Studies:
- Removal of SSR: drops VBench score from 81.28 to 75.72 (192p ablation).
- Pseudo-spacetime pyramid: 80.30 vs. 81.28 for full spacetime.
- Disabling SQD: 81.07 vs. 81.28.
- Full attention (≈57 GB VRAM) vs. SSA (≈40 GB VRAM): little difference in score (81.28 vs. 80.77), substantial efficiency gain.

Notably, InfinityStar is the first discrete autoregressive video generator verified to produce industrial-grade 720p outputs.

6. Design Impact, Limitations, and Open Directions

InfinityStar signifies a shift toward unifying image and video synthesis within a single discrete autoregressive framework. The spacetime pyramid architecture effectively decouples static appearance from dynamic motion, enabling versatile task handling. Bitwise self-correction and sparse attention mechanisms contribute substantially to efficiency, while leveraging continuous VAEs and SQD yields fast, compact, and high-fidelity token representation.

Open research questions include narrowing fidelity gaps in high-motion scenes, scaling model size and compute to parity with advanced diffusion models, further inference acceleration (e.g., kernel fusion, early-exit strategies), and expanding semantic-detail conditioning for richer multi-agent or interactive scenarios.

The release of InfinityStar’s code and models is intended to support reproducibility and ongoing innovation in efficient, high-quality visual generation. See https://github.com/FoundationVision/InfinityStar for resources.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to InfinityStar.