Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 120 tok/s Pro

Kimi K2 193 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation (2502.20388v2)

Published 27 Feb 2025 in cs.CV

Abstract: Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling.

Summary

The paper introduces xAR, a generalized autoregressive framework for visual generation extending next-token to next-X prediction and using noisy context learning to mitigate exposure bias.
xAR redefines prediction units (X) beyond single tokens to flexible entities like cells or scales, leveraging continuous entity regression via flow-matching.
Evaluations show xAR-H achieves state-of-the-art FID of 1.24 on ImageNet-256 and xAR-L a FID of 1.70 on ImageNet-512, outperforming prior methods.

The paper introduces xAR, a generalized autoregressive (AR) framework for visual generation that extends the conventional next-token prediction paradigm to next-X prediction. This framework addresses limitations in traditional AR models, such as the lack of a universally agreed-upon token definition for 2D image structures and the issue of error accumulation due to exposure bias.

The core idea of xAR is to redefine the prediction unit from a single token to a flexible entity X. X can represent an individual patch token, a cell (a $k \times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even an entire image. The paper also reformulates discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning (NCL), which mitigates exposure bias.

The authors argue that xAR offers two key advantages: flexible prediction units that capture different contextual granularity and spatial structures and reduced exposure bias by avoiding reliance on teacher forcing.

Key aspects of the proposed method:

Next-X Prediction: xAR generalizes next-token prediction to next-X prediction, where X can be a token, cell, subsample, scale, or entire image. This flexibility allows the model to capture different contextual granularities and spatial structures.
Noisy Context Learning (NCL): xAR introduces NCL, where the model is trained on noisy entities instead of ground truth inputs. This reduces over-reliance on pristine contexts, improving robustness and mitigating exposure bias. During training, noise is added to the input entities using randomly sampled noise time steps $t_n$ $t_{n}$ and noise samples $\epsilon_n$ $ϵ_{n}$ from a Gaussian distribution. The interpolated input $F_n^{t_n}$ $F_{n}^{t_{n}}$ is constructed as $F_n^{t_n} = (1 - t_n) X_n + t_n \epsilon_n$ $F_{n}^{t_{n}} = (1 - t_{n}) X_{n} + t_{n} ϵ_{n}$ , where $X_n$ $X_{n}$ is the $n$ $n$ -th entity. The model is trained to predict the velocity $V_n^{t_n} = \epsilon_n - X_n$ $V_{n}^{t_{n}} = ϵ_{n} - X_{n}$ using all preceding and current noisy entities. The loss function is defined as:

$\mathcal{L} = \sum_{n=1}^N \Bigl\| \mathrm{xAR}\bigl(\{F_1^{t_1}, \dots, F_{n}^{t_{n}}\}, t_n; \theta\bigr) - V_n^{t_n} \Bigr\|^2$

where:
- $\mathcal{L}$ is the loss function
- $N$ is the number of entities
- $F_i^{t_i}$ is the noisy entity at the $i$ -th step
- $t_i$ is the noise time step for the $i$ -th entity
- $\theta$ represents the parameters of the xAR model
- $V_n^{t_n}$ is the velocity
Inference Scheme: At inference, xAR performs autoregressive prediction at the entity level. The model begins by predicting an initial cell $\hat{X}_1$ from a Gaussian noise sample $\epsilon_1$ via flow matching. Conditioned on the clean estimate $\hat{X}_1$ , xAR generates the next cell $\hat{X}_2$ from another Gaussian noise sample $\epsilon_2$ . This process continues autoregressively, refining the image at the cell level.

The paper includes experiments on ImageNet at $256 \times 256$ and $512 \times 512$ resolutions. The models are evaluated using Fréchet Inception Distance (FID), Inception Score (IS), Precision, and Recall. The results show that xAR-H achieves a state-of-the-art FID of 1.24 on ImageNet-256 without relying on vision foundation models or guidance interval sampling. The base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20x faster inference. On ImageNet-512, xAR-L sets a new state-of-the-art FID of 1.70.

Ablation studies demonstrate that cell-based xAR achieves the best performance, with an FID of 2.48, outperforming token-based xAR by 1.03 FID. Additionally, a cell size of $8 \times 8$ tokens achieves the best performance. The ablation paper also analyzes the impact of NCL, showing that conditioning on all clean entities results in suboptimal performance. The best performance is achieved with the "random noise" setting, where no constraints are imposed on noise time steps.