NextStep-1: AR Text-to-Image Framework
- NextStep-1 is an autoregressive text-to-image generation framework that unifies text and continuous visual tokens for scalable, high-fidelity synthesis.
- It employs a 14-billion-parameter Transformer backbone and a lightweight 157M-parameter visual head using a flow matching loss for continuous token prediction.
- The model achieves state-of-the-art benchmark performance and enables robust image editing through integrated next-token prediction objectives.
NextStep-1 is an autoregressive text-to-image generation framework that utilizes a unified sequence modeling approach over mixed discrete (text) and continuous (image) tokens at large scale. Centered on a 14-billion-parameter Transformer and a lightweight flow matching visual head with 157 million parameters, NextStep-1 demonstrates state-of-the-art performance among autoregressive (AR) generation models and introduces a scalable solution for both high-fidelity image synthesis and editing using only next-token prediction objectives.
1. Architectural Overview
The NextStep-1 model is structured as an autoregressive causal Transformer operating on a multimodal token sequence: where each is either a discrete text token (processed with a standard LM head) or a continuous image token (processed with a flow matching head).
Key architectural components:
- Backbone: 14B-parameter causal Transformer, initialized from Qwen2.5-14B, responsible for modeling joint token distributions.
- Text Head: Conventional LM head utilizing cross-entropy loss on discrete tokens.
- Visual Head: 157M-parameter MLP with 12 layers and 1536 hidden dimensions; implements a flow matching objective for continuous latent image tokens.
- Tokenizer: Images are projected into a continuous latent space (Flux VAE, 16-channel latents) and sequentialized for the Transformer (optionally via space-to-depth conversion).
This unified design explicitly factors the joint token likelihood as: with parametrization for both modalities within the same AR sequence.
2. Training Paradigm and Objectives
Training is guided by simultaneous next-token prediction objectives for both text and image modalities:
- Text Module: Cross-entropy loss over discrete vocabulary.
- Visual Module: Flow matching loss, defined as MSE between model-predicted velocity (flow) and that required to transform a noised image patch to the clean (target) one. The head processes autoregressive conditioning up to the current patch.
- Global Loss: Weighted sum of both objectives,
where weights are tuned to balance text and visual learning.
The system is trained end-to-end on sequences constructed by interleaving text and visual tokens, ensuring the model's capacity to handle transition points and cross-modal context within the AR framework. For visual tokenization, image latents may be produced with additional pre-processing such as space-to-depth reshaping to produce a 1D sequence compatible with the Transformer input.
3. Quantitative Performance and Benchmarks
NextStep-1 achieves state-of-the-art AR model performance on a suite of established and recently proposed text-to-image and compositional understanding benchmarks:
Benchmark | Score | Score (Self-CoT) |
---|---|---|
GenEval | 0.63 | 0.73 |
GenAI-Bench (basic) | 0.88 | 0.90 |
GenAI-Bench (adv) | 0.67 | 0.74 |
DPG-Bench | 85.28 | – |
WISE (direct) | 0.54 | 0.67 |
WISE (rewrite) | 0.79 | – |
OneIG-Bench | higher than AR peers (precise value not quoted) | – |
Performance is consistently better than prior AR models such as Emu3 and Janus-Pro, and competitive with diffusion models. The adoption of Self-Chain-of-Thought (Self-CoT) prompting further boosts scores, particularly for compositional and complex semantic queries. Each of these benchmarks tests not only image fidelity but also text alignment, compositionality, and semantic understanding.
4. High-Quality Image Editing Extension
NextStep-1 can be extended to image editing (NextStep-1-Edit) by conditioning generation on additional image editing instructions expressed in natural language. The unified AR framework enables this extension without structural changes:
- Editing instructions are prepended or interleaved as text tokens in the sequence.
- The model conditions generation and synthesis of visual tokens on both the base image latents and the editing prompt.
Empirical evaluation demonstrates strong results:
- GEdit-Bench (English): 6.58
- ImgEdit-Bench: 3.71
Qualitative outputs feature realistic and precise edits directed by user text, facilitated by the continuous flow matching head, which allows for smooth latent-space transformations at the patch level.
5. Comparison with Previous Methods
Prevailing AR approaches typically use either vector quantization (VQ) to discretize image features, introducing quantization loss, or apply diffusion models for continuous image prediction, entailing high computational cost. NextStep-1 diverges by:
- Avoiding discrete quantization and associated loss via direct continuous modeling.
- Bypassing full diffusion sampling by employing a relatively lightweight flow matching head (157M parameters) rather than a full diffusion UNet.
- Achieving competitive or superior benchmarks with lower computational complexity in the visual head and rigorous sequence-level AR training.
This approach demonstrates that AR models, when scaled and paired with appropriately designed visual heads, can match or exceed the visual fidelity and compositional capabilities of more resource-intensive generative techniques.
6. Open Source Release and Impact
To accelerate community research and reproducibility, the authors commit to releasing both code and trained model weights. This includes full recipes for the multimodal AR training setup and integration details for text and visual tokenization, as well as the high-capacity Transformer and flow matching head. The open sourcing effort is positioned to lower the barrier to entry for research on sequence modeling in text-to-image and image editing, and is expected to facilitate the development, benchmarking, and rapid iteration upon unified generative multimodal architectures at scale.
7. Significance and Future Directions
By establishing a scalable, AR-based text-to-image generation paradigm with continuous tokens and demonstrating its comparable or superior efficacy to diffusion and VQ-based models, NextStep-1 introduces a powerful framework for unified text-image modeling and editing. Its flow matching mechanism also exemplifies how deterministic, autoregressive training can accomplish high-dimensional, continuous generation without the computational burden of diffusion processes.
Potential future work may include:
- Scaling further AR models and flow matching heads for even higher visual fidelity or longer range compositional modeling.
- Systematic exploration of cross-modal transfer and inpainting.
- Investigation of more sophisticated tokenization strategies and robustness to out-of-distribution prompts or edits.
In sum, NextStep-1 demonstrates that the autoregressive paradigm, freed from reliance on quantization or diffusion, is viable and effective for large-scale multimodal synthesis, opening new directions for efficient, unified generative modeling in vision-language research (Team et al., 14 Aug 2025).