Papers
Topics
Authors
Recent
2000 character limit reached

NextFlow Model: Unified Multimodal AR Transformer

Updated 7 January 2026
  • NextFlow is a unified decoder-only autoregressive transformer that employs hierarchical dual-codebook tokenization for efficient multimodal generation.
  • It uses a next-scale prediction mechanism to interleave text and image tokens, offering approximately 6× lower inference FLOPs compared to diffusion baselines.
  • The model supports high-resolution image, video frame, and in-context editing, making it a robust architecture for diverse multimodal tasks.

NextFlow is a unified, decoder-only autoregressive (AR) transformer model enabling simultaneous multimodal understanding and generation across interleaved streams of text and images. Departing from conventional raster-scan tokenization for vision, it introduces a hierarchical, next-scale prediction mechanism and dual-codebook visual tokenization, resulting in state-of-the-art efficiency for large-scale multimodal tasks. The design allows native support for interleaved text, high-resolution image, in-context editing, and video frame generation, positioning NextFlow among the leading AR models for both capability and speed in multimodal modeling (Zhang et al., 5 Jan 2026).

1. Architectural Components

The NextFlow backbone is a large-scale, decoder-only transformer, initialized from Qwen2.5-VL-7B. The structural parameters are as follows: L=32L=32 layers, model dimension d=4096d=4096, H=32H=32 attention heads (head dimension 128), and a feedforward dimension dff16,384d_\mathrm{ff} \approx 16,384. All modalities (text and vision tokens) share a unified output vocabulary and a single prediction head, which maximizes cross-modal parameter sharing.

Sequences are constructed by interleaving standard subword text tokens (vocabulary size ~32K) with vision tokens at multiple spatial resolutions. Vision tokens arise from a dual-codebook tokenizer acting at S spatial scales per image, supporting hierarchical prediction. Multi-modal input order reflects the data’s flow (e.g., caption followed by multi-scale visual tokens, possibly interleaved further with text). Positional encoding uses “Multiscale 3D RoPE” (Rotary Positional Embedding), yielding ptext(t)=(t,t,t)p_\text{text}(t) = (t, t, t) for text and pvis(i,j,s)p_\text{vis}(i,j,s) (function of spatial coordinates and scale) for vision tokens. A learnable scale-length embedding further disambiguates scale indices.

2. Unified Visual Tokenization

Vision inputs are tokenized by a dual-codebook Vector Quantizer (VQ) architecture, following the TokenFlow approach. One codebook encodes high-level semantic content, distilled from a pretrained SigLIP model; the other encodes pixel-level detail. A joint quantization loss Lq=Dsemantic+λDpixelL_q = D_\text{semantic} + \lambda D_\text{pixel} ensures that representation captures both conceptual content and texture.

Multi-scale VQ generates a progressive token map: lowest scale corresponds to a 1×11\times1 token, subsequent scales double spatial granularity up to full image resolution (e.g., 64×6464\times64 tokens for 1024×10241024\times1024 images). Each patch feature fs(i,j)f_s(i,j) is quantized to zs(i,j)=argminkfs(i,j)E(k)2+semantic penalty(k)z_s(i,j) = \arg \min_k \| f_s(i,j) - E(k)\|^2 + \text{semantic penalty}(k), selecting a discrete index from the visual codebook VVV_V.

This hierarchical representation yields factors that can be autoregressively modeled and predicted with computational efficiency, underpinning NextFlow’s next-scale prediction.

3. Prediction Mechanisms

Text is modeled via standard next-token AR prediction: for tokens x1,,xTx_1,\ldots,x_T, the model maximizes P(x)=t=1TP(xtx<t)P(x) = \prod_{t=1}^T P(x_t|x_{<t}) via a shared output head.

For images, NextFlow departs from flattening spatial tokens into a single sequence (raster-scan AR), instead adopting a next-scale prediction scheme. At each scale \ell with N=HWN_\ell = H_\ell W_\ell positions, tokens s,1,,s,Ns_{\ell,1},\ldots,s_{\ell,N_\ell} are generated conditioned on all coarser scales s<,s_{<\ell, \cdot} and previous positions s,<js_{\ell,<j} by

P({s,j}=1...L,j=1...N)==1Lj=1NP(s,js<,,s,<j)P(\{s_{\ell,j}\}_{\ell=1...L, j=1...N_\ell}) = \prod_{\ell=1}^L \prod_{j=1}^{N_\ell} P(s_{\ell,j} \mid s_{<\ell,\cdot}, s_{\ell,<j})

The joint training objective sums cross-entropy for both modalities:

L=Edata[t=1TlogP(xtx<t)+=1Lj=1NlogP(s,jprefix)]L = - \mathbb{E}_{\text{data}}\left[ \sum_{t=1}^T \log P(x_t|x_{<t}) + \sum_{\ell=1}^L \sum_{j=1}^{N_\ell} \log P(s_{\ell,j}|\text{prefix}) \right]

This factorization supports linear-complexity autoregressive generation with respect to the number of tokens, compared to the quadratic growth characteristic of standard raster-scan AR for images. Empirical results demonstrate approximately 6×6\times lower inference FLOPs compared to diffusion-transformer baselines such as MMDiT (Zhang et al., 5 Jan 2026).

4. Training Methodology and Stability Strategies

Training proceeds in several curriculum phases: initialization and alignment on Qwen2.5-VL with 10M text-image pairs, followed by progressive pretraining on resolutions 2562512210242256^2 \rightarrow 512^2 \rightarrow 1024^2 pixels using a data corpus of approximately 6 trillion tokens. Data mixture spans pure text (700\sim700M), text-image pairs (T2I 1.9\sim1.9B, I2T 0.5\sim0.5B), image editing (20\sim20M), and video-text data (150\sim150M).

A scale-aware loss reweighting assigns each scale \ell a coefficient k=1/(HW)αk_\ell = 1/(H_\ell W_\ell)^\alpha, α=0.9\alpha=0.9, to equalize contributions from coarse and fine scales. Self-correction is applied via stochastic codebook sampling during encoding, training the model to predict the deterministic (“top-1”) code from noisy (“top-k”) prefixes, which mitigates error accumulation and visual artifacts. Feature inputs to the transformer are limited to upsampled residual codebook embeddings, obviating memory and compute bottlenecks from full-scale feature maps.

Supervised fine-tuning and continued training use curated, high-aesthetic, and conversational multimodal data, adapting the model for practical downstream use cases.

5. Prefix-Tuned Reinforcement Learning for Generation

NextFlow introduces a GRPO-style (Generalized Reinforcement Prefix Optimization) prefix-tuning strategy for reinforcement learning. Multiscale autoregressive generation is reframed as a Markov Decision Process: action ata_t is the grid of tokens at scale tt, and the policy factorizes over spatial positions. After GG trajectory rollouts per prompt, advantages AiA_i are computed and used in a PPO-style clipped objective with scale-based loss reweighting and KL penalty:

LGRPO(θ)=Ec,{si}[1Gi=1Gt=1mktmin(rti(θ)Ai,clip(rti,1ϵ,1+ϵ)Ai)βDKL(πθ()πref())]L_\mathrm{GRPO}(\theta) = \mathbb{E}_{c, \{s^i\}} \left[ \frac{1}{G} \sum_{i=1}^G \sum_{t=1}^m k_t \cdot \min(r^i_t(\theta) A_i, \operatorname{clip}(r^i_t, 1-\epsilon, 1+\epsilon)A_i) - \beta \cdot D_\mathrm{KL}(\pi_\theta(\cdot) \Vert \pi_\mathrm{ref}(\cdot)) \right]

Where only the first mm coarse scales (e.g., m=8m=8) are fine-tuned. Finer-scale heads are frozen, constraining high-variance RL updates to global structure. This enables reward-driven improvements (e.g., for text-image alignment) while minimizing instability.

6. Comparative Evaluation

Empirical evaluation of NextFlow demonstrates:

  • Generation quality on GenEval (text-image alignment): 0.83 (RL-finetuned 0.84), surpassing previous unified AR models and matching diffusion SOTA (~0.82).
  • DPG (detailed prompt following): 86.0 (RL-tuned matches SOTA 88.3).
  • WISE (world knowledge): 0.59 (RL: 0.62, matches Qwen-Image).
  • PRISM-Bench (imagination/style): 74.7 (RL: 78.8, approaching HiDream SOTA).
  • Image editing quality: on ImgEdit (GPT-4.1 scores), NextFlow 4.44, RL 4.49; EditCanvas 7.93, RL 8.04, competitive with EMU3.5.
  • Qualitative attributes: chain-of-thought reasoning before drawing (WISE improves .60→.70 with reasoning prompts); robust in-context editing; and efficient high-resolution (1024×1024) image generation in 5 seconds on 8×A100 GPUs (%%%%39Lq=Dsemantic+λDpixelL_q = D_\text{semantic} + \lambda D_\text{pixel}40%%%% faster than both raster-scan AR and diffusion baselines).

A summary table of comparative results is provided below.

Benchmark NextFlow (Base) NextFlow (RL-finetune) Diffusion SOTA / AR SOTA
GenEval (alignment) 0.83 0.84 ~0.82 (diff.)
DPG (prompt following) 86.0 88.3 88.3 (EMU3.5)
PRISM-Bench (imagination) 74.7 78.8 75.9 (HiDream)
ImgEdit (editing, GPT-4.1) 4.44 4.49 4.41 (EMU3.5)
EditCanvas (editing, average) 7.93 8.04 8.37 (EMU3.5)
1024×1024 speed (s) 5 5 300–600 (AR/diffusion)

7. Design Implications and Significance

The NextFlow model demonstrates that a single, unified decoder-only AR transformer with hierarchical, dual-codebook visual tokenization is sufficient to achieve both state-of-the-art visual quality and highly efficient inference in multimodal domains. The next-scale prediction mechanism—combined with scale-aware training heuristics and prefix-tuned RL—resolves the scalability issues endemic to prior AR baselines and enables direct interleaved generation at high resolutions and across modalities.

Theoretical analysis and benchmarks indicate a computational advantage (approximately six-fold in inference FLOPs) over raster-scan AR models, with practical speed-ups to match. NextFlow also evidences native support for interleaved modality generation (including video and caption streams), in-context editing, and compositional reasoning. These properties establish the model as a reference architecture for unified multimodal generative modeling, bridging the prior gap between unified AR models and diffusion-based visual SOTA (Zhang et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to NextFlow Model.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube