NextFlow Model: Unified Multimodal AR Transformer

Updated 7 January 2026

NextFlow is a unified decoder-only autoregressive transformer that employs hierarchical dual-codebook tokenization for efficient multimodal generation.
It uses a next-scale prediction mechanism to interleave text and image tokens, offering approximately 6× lower inference FLOPs compared to diffusion baselines.
The model supports high-resolution image, video frame, and in-context editing, making it a robust architecture for diverse multimodal tasks.

NextFlow is a unified, decoder-only autoregressive (AR) transformer model enabling simultaneous multimodal understanding and generation across interleaved streams of text and images. Departing from conventional raster-scan tokenization for vision, it introduces a hierarchical, next-scale prediction mechanism and dual-codebook visual tokenization, resulting in state-of-the-art efficiency for large-scale multimodal tasks. The design allows native support for interleaved text, high-resolution image, in-context editing, and video frame generation, positioning NextFlow among the leading AR models for both capability and speed in multimodal modeling (Zhang et al., 5 Jan 2026).

1. Architectural Components

The NextFlow backbone is a large-scale, decoder-only transformer, initialized from Qwen2.5-VL-7B. The structural parameters are as follows: $L=32$ layers, model dimension $d=4096$ , $H=32$ attention heads (head dimension 128), and a feedforward dimension $d_\mathrm{ff} \approx 16,384$ . All modalities (text and vision tokens) share a unified output vocabulary and a single prediction head, which maximizes cross-modal parameter sharing.

Sequences are constructed by interleaving standard subword text tokens (vocabulary size ~32K) with vision tokens at multiple spatial resolutions. Vision tokens arise from a dual-codebook tokenizer acting at S spatial scales per image, supporting hierarchical prediction. Multi-modal input order reflects the data’s flow (e.g., caption followed by multi-scale visual tokens, possibly interleaved further with text). Positional encoding uses “Multiscale 3D RoPE” (Rotary Positional Embedding), yielding $p_\text{text}(t) = (t, t, t)$ for text and $p_\text{vis}(i,j,s)$ (function of spatial coordinates and scale) for vision tokens. A learnable scale-length embedding further disambiguates scale indices.

2. Unified Visual Tokenization

Vision inputs are tokenized by a dual-codebook Vector Quantizer (VQ) architecture, following the TokenFlow approach. One codebook encodes high-level semantic content, distilled from a pretrained SigLIP model; the other encodes pixel-level detail. A joint quantization loss $L_q = D_\text{semantic} + \lambda D_\text{pixel}$ ensures that representation captures both conceptual content and texture.

Multi-scale VQ generates a progressive token map: lowest scale corresponds to a $1\times1$ token, subsequent scales double spatial granularity up to full image resolution (e.g., $64\times64$ tokens for $1024\times1024$ images). Each patch feature $f_s(i,j)$ is quantized to $z_s(i,j) = \arg \min_k \| f_s(i,j) - E(k)\|^2 + \text{semantic penalty}(k)$ , selecting a discrete index from the visual codebook $V_V$ .

This hierarchical representation yields factors that can be autoregressively modeled and predicted with computational efficiency, underpinning NextFlow’s next-scale prediction.

3. Prediction Mechanisms

Text is modeled via standard next-token AR prediction: for tokens $x_1,\ldots,x_T$ , the model maximizes $P(x) = \prod_{t=1}^T P(x_t|x_{<t})$ via a shared output head.

For images, NextFlow departs from flattening spatial tokens into a single sequence (raster-scan AR), instead adopting a next-scale prediction scheme. At each scale $\ell$ with $N_\ell = H_\ell W_\ell$ positions, tokens $s_{\ell,1},\ldots,s_{\ell,N_\ell}$ are generated conditioned on all coarser scales $s_{<\ell, \cdot}$ and previous positions $s_{\ell,<j}$ by

$P(\{s_{\ell,j}\}_{\ell=1...L, j=1...N_\ell}) = \prod_{\ell=1}^L \prod_{j=1}^{N_\ell} P(s_{\ell,j} \mid s_{<\ell,\cdot}, s_{\ell,<j})$

The joint training objective sums cross-entropy for both modalities:

$L = - \mathbb{E}_{\text{data}}\left[ \sum_{t=1}^T \log P(x_t|x_{<t}) + \sum_{\ell=1}^L \sum_{j=1}^{N_\ell} \log P(s_{\ell,j}|\text{prefix}) \right]$

This factorization supports linear-complexity autoregressive generation with respect to the number of tokens, compared to the quadratic growth characteristic of standard raster-scan AR for images. Empirical results demonstrate approximately $6\times$ lower inference FLOPs compared to diffusion-transformer baselines such as MMDiT (Zhang et al., 5 Jan 2026).

4. Training Methodology and Stability Strategies

Training proceeds in several curriculum phases: initialization and alignment on Qwen2.5-VL with 10M text-image pairs, followed by progressive pretraining on resolutions $256^2 \rightarrow 512^2 \rightarrow 1024^2$ pixels using a data corpus of approximately 6 trillion tokens. Data mixture spans pure text ( $\sim700$ M), text-image pairs (T2I $\sim1.9$ B, I2T $\sim0.5$ B), image editing ( $\sim20$ M), and video-text data ( $\sim150$ M).

A scale-aware loss reweighting assigns each scale $\ell$ a coefficient $k_\ell = 1/(H_\ell W_\ell)^\alpha$ , $\alpha=0.9$ , to equalize contributions from coarse and fine scales. Self-correction is applied via stochastic codebook sampling during encoding, training the model to predict the deterministic (“top-1”) code from noisy (“top-k”) prefixes, which mitigates error accumulation and visual artifacts. Feature inputs to the transformer are limited to upsampled residual codebook embeddings, obviating memory and compute bottlenecks from full-scale feature maps.

Supervised fine-tuning and continued training use curated, high-aesthetic, and conversational multimodal data, adapting the model for practical downstream use cases.

5. Prefix-Tuned Reinforcement Learning for Generation

NextFlow introduces a GRPO-style (Generalized Reinforcement Prefix Optimization) prefix-tuning strategy for reinforcement learning. Multiscale autoregressive generation is reframed as a Markov Decision Process: action $a_t$ is the grid of tokens at scale $t$ , and the policy factorizes over spatial positions. After $G$ trajectory rollouts per prompt, advantages $A_i$ are computed and used in a PPO-style clipped objective with scale-based loss reweighting and KL penalty:

$L_\mathrm{GRPO}(\theta) = \mathbb{E}_{c, \{s^i\}} \left[ \frac{1}{G} \sum_{i=1}^G \sum_{t=1}^m k_t \cdot \min(r^i_t(\theta) A_i, \operatorname{clip}(r^i_t, 1-\epsilon, 1+\epsilon)A_i) - \beta \cdot D_\mathrm{KL}(\pi_\theta(\cdot) \Vert \pi_\mathrm{ref}(\cdot)) \right]$

Where only the first $m$ coarse scales (e.g., $m=8$ ) are fine-tuned. Finer-scale heads are frozen, constraining high-variance RL updates to global structure. This enables reward-driven improvements (e.g., for text-image alignment) while minimizing instability.

6. Comparative Evaluation

Empirical evaluation of NextFlow demonstrates:

Generation quality on GenEval (text-image alignment): 0.83 (RL-finetuned 0.84), surpassing previous unified AR models and matching diffusion SOTA (~0.82).
DPG (detailed prompt following): 86.0 (RL-tuned matches SOTA 88.3).
WISE (world knowledge): 0.59 (RL: 0.62, matches Qwen-Image).
PRISM-Bench (imagination/style): 74.7 (RL: 78.8, approaching HiDream SOTA).
Image editing quality: on ImgEdit (GPT-4.1 scores), NextFlow 4.44, RL 4.49; EditCanvas 7.93, RL 8.04, competitive with EMU3.5.
Qualitative attributes: chain-of-thought reasoning before drawing (WISE improves .60→.70 with reasoning prompts); robust in-context editing; and efficient high-resolution (1024×1024) image generation in 5 seconds on 8×A100 GPUs (%%%%39 $L_q = D_\text{semantic} + \lambda D_\text{pixel}$ 40%%%% faster than both raster-scan AR and diffusion baselines).

A summary table of comparative results is provided below.

Benchmark	NextFlow (Base)	NextFlow (RL-finetune)	Diffusion SOTA / AR SOTA
GenEval (alignment)	0.83	0.84	~0.82 (diff.)
DPG (prompt following)	86.0	88.3	88.3 (EMU3.5)
PRISM-Bench (imagination)	74.7	78.8	75.9 (HiDream)
ImgEdit (editing, GPT-4.1)	4.44	4.49	4.41 (EMU3.5)
EditCanvas (editing, average)	7.93	8.04	8.37 (EMU3.5)
1024×1024 speed (s)	5	5	300–600 (AR/diffusion)

7. Design Implications and Significance

The NextFlow model demonstrates that a single, unified decoder-only AR transformer with hierarchical, dual-codebook visual tokenization is sufficient to achieve both state-of-the-art visual quality and highly efficient inference in multimodal domains. The next-scale prediction mechanism—combined with scale-aware training heuristics and prefix-tuned RL—resolves the scalability issues endemic to prior AR baselines and enables direct interleaved generation at high resolutions and across modalities.

Theoretical analysis and benchmarks indicate a computational advantage (approximately six-fold in inference FLOPs) over raster-scan AR models, with practical speed-ups to match. NextFlow also evidences native support for interleaved modality generation (including video and caption streams), in-context editing, and compositional reasoning. These properties establish the model as a reference architecture for unified multimodal generative modeling, bridging the prior gap between unified AR models and diffusion-based visual SOTA (Zhang et al., 5 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to NextFlow Model.

NextFlow Model: Unified Multimodal AR Transformer

1. Architectural Components

2. Unified Visual Tokenization

3. Prediction Mechanisms

4. Training Methodology and Stability Strategies

5. Prefix-Tuned Reinforcement Learning for Generation

6. Comparative Evaluation

7. Design Implications and Significance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NextFlow Model: Unified Multimodal AR Transformer

1. Architectural Components

2. Unified Visual Tokenization

3. Prediction Mechanisms

4. Training Methodology and Stability Strategies

5. Prefix-Tuned Reinforcement Learning for Generation

6. Comparative Evaluation

7. Design Implications and Significance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research