Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation (2502.20388v2)

Published 27 Feb 2025 in cs.CV

Abstract: Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling.

Summary

  • The paper introduces xAR, a generalized autoregressive framework for visual generation extending next-token to next-X prediction and using noisy context learning to mitigate exposure bias.
  • xAR redefines prediction units (X) beyond single tokens to flexible entities like cells or scales, leveraging continuous entity regression via flow-matching.
  • Evaluations show xAR-H achieves state-of-the-art FID of 1.24 on ImageNet-256 and xAR-L a FID of 1.70 on ImageNet-512, outperforming prior methods.

The paper introduces xAR, a generalized autoregressive (AR) framework for visual generation that extends the conventional next-token prediction paradigm to next-X prediction. This framework addresses limitations in traditional AR models, such as the lack of a universally agreed-upon token definition for 2D image structures and the issue of error accumulation due to exposure bias.

The core idea of xAR is to redefine the prediction unit from a single token to a flexible entity X. X can represent an individual patch token, a cell (a k×kk \times k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even an entire image. The paper also reformulates discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning (NCL), which mitigates exposure bias.

The authors argue that xAR offers two key advantages: flexible prediction units that capture different contextual granularity and spatial structures and reduced exposure bias by avoiding reliance on teacher forcing.

Key aspects of the proposed method:

  • Next-X Prediction: xAR generalizes next-token prediction to next-X prediction, where X can be a token, cell, subsample, scale, or entire image. This flexibility allows the model to capture different contextual granularities and spatial structures.
  • Noisy Context Learning (NCL): xAR introduces NCL, where the model is trained on noisy entities instead of ground truth inputs. This reduces over-reliance on pristine contexts, improving robustness and mitigating exposure bias. During training, noise is added to the input entities using randomly sampled noise time steps tnt_n and noise samples ϵn\epsilon_n from a Gaussian distribution. The interpolated input FntnF_n^{t_n} is constructed as Fntn=(1tn)Xn+tnϵnF_n^{t_n} = (1 - t_n) X_n + t_n \epsilon_n, where XnX_n is the nn-th entity. The model is trained to predict the velocity Vntn=ϵnXnV_n^{t_n} = \epsilon_n - X_n using all preceding and current noisy entities. The loss function is defined as:

    L=n=1NxAR({F1t1,,Fntn},tn;θ)Vntn2\mathcal{L} = \sum_{n=1}^N \Bigl\| \mathrm{xAR}\bigl(\{F_1^{t_1}, \dots, F_{n}^{t_{n}}\}, t_n; \theta\bigr) - V_n^{t_n} \Bigr\|^2

    where:

    • L\mathcal{L} is the loss function
    • NN is the number of entities
    • FitiF_i^{t_i} is the noisy entity at the ii-th step
    • tit_i is the noise time step for the ii-th entity
    • θ\theta represents the parameters of the xAR model
    • VntnV_n^{t_n} is the velocity
  • Inference Scheme: At inference, xAR performs autoregressive prediction at the entity level. The model begins by predicting an initial cell X^1\hat{X}_1 from a Gaussian noise sample ϵ1\epsilon_1 via flow matching. Conditioned on the clean estimate X^1\hat{X}_1, xAR generates the next cell X^2\hat{X}_2 from another Gaussian noise sample ϵ2\epsilon_2. This process continues autoregressively, refining the image at the cell level.

The paper includes experiments on ImageNet at 256×256256 \times 256 and 512×512512 \times 512 resolutions. The models are evaluated using Fréchet Inception Distance (FID), Inception Score (IS), Precision, and Recall. The results show that xAR-H achieves a state-of-the-art FID of 1.24 on ImageNet-256 without relying on vision foundation models or guidance interval sampling. The base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20x faster inference. On ImageNet-512, xAR-L sets a new state-of-the-art FID of 1.70.

Ablation studies demonstrate that cell-based xAR achieves the best performance, with an FID of 2.48, outperforming token-based xAR by 1.03 FID. Additionally, a cell size of 8×88 \times 8 tokens achieves the best performance. The ablation paper also analyzes the impact of NCL, showing that conditioning on all clean entities results in suboptimal performance. The best performance is achieved with the "random noise" setting, where no constraints are imposed on noise time steps.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 10 likes.

Reddit Logo Streamline Icon: https://streamlinehq.com