Papers
Topics
Authors
Recent
Search
2000 character limit reached

NextFlow Model: Unified Multimodal AR Transformer

Updated 7 January 2026
  • NextFlow is a unified decoder-only autoregressive transformer that employs hierarchical dual-codebook tokenization for efficient multimodal generation.
  • It uses a next-scale prediction mechanism to interleave text and image tokens, offering approximately 6× lower inference FLOPs compared to diffusion baselines.
  • The model supports high-resolution image, video frame, and in-context editing, making it a robust architecture for diverse multimodal tasks.

NextFlow is a unified, decoder-only autoregressive (AR) transformer model enabling simultaneous multimodal understanding and generation across interleaved streams of text and images. Departing from conventional raster-scan tokenization for vision, it introduces a hierarchical, next-scale prediction mechanism and dual-codebook visual tokenization, resulting in state-of-the-art efficiency for large-scale multimodal tasks. The design allows native support for interleaved text, high-resolution image, in-context editing, and video frame generation, positioning NextFlow among the leading AR models for both capability and speed in multimodal modeling (Zhang et al., 5 Jan 2026).

1. Architectural Components

The NextFlow backbone is a large-scale, decoder-only transformer, initialized from Qwen2.5-VL-7B. The structural parameters are as follows: L=32L=32 layers, model dimension d=4096d=4096, H=32H=32 attention heads (head dimension 128), and a feedforward dimension dff16,384d_\mathrm{ff} \approx 16,384. All modalities (text and vision tokens) share a unified output vocabulary and a single prediction head, which maximizes cross-modal parameter sharing.

Sequences are constructed by interleaving standard subword text tokens (vocabulary size ~32K) with vision tokens at multiple spatial resolutions. Vision tokens arise from a dual-codebook tokenizer acting at S spatial scales per image, supporting hierarchical prediction. Multi-modal input order reflects the data’s flow (e.g., caption followed by multi-scale visual tokens, possibly interleaved further with text). Positional encoding uses “Multiscale 3D RoPE” (Rotary Positional Embedding), yielding ptext(t)=(t,t,t)p_\text{text}(t) = (t, t, t) for text and pvis(i,j,s)p_\text{vis}(i,j,s) (function of spatial coordinates and scale) for vision tokens. A learnable scale-length embedding further disambiguates scale indices.

2. Unified Visual Tokenization

Vision inputs are tokenized by a dual-codebook Vector Quantizer (VQ) architecture, following the TokenFlow approach. One codebook encodes high-level semantic content, distilled from a pretrained SigLIP model; the other encodes pixel-level detail. A joint quantization loss Lq=Dsemantic+λDpixelL_q = D_\text{semantic} + \lambda D_\text{pixel} ensures that representation captures both conceptual content and texture.

Multi-scale VQ generates a progressive token map: lowest scale corresponds to a 1×11\times1 token, subsequent scales double spatial granularity up to full image resolution (e.g., 64×6464\times64 tokens for 1024×10241024\times1024 images). Each patch feature d=4096d=40960 is quantized to d=4096d=40961, selecting a discrete index from the visual codebook d=4096d=40962.

This hierarchical representation yields factors that can be autoregressively modeled and predicted with computational efficiency, underpinning NextFlow’s next-scale prediction.

3. Prediction Mechanisms

Text is modeled via standard next-token AR prediction: for tokens d=4096d=40963, the model maximizes d=4096d=40964 via a shared output head.

For images, NextFlow departs from flattening spatial tokens into a single sequence (raster-scan AR), instead adopting a next-scale prediction scheme. At each scale d=4096d=40965 with d=4096d=40966 positions, tokens d=4096d=40967 are generated conditioned on all coarser scales d=4096d=40968 and previous positions d=4096d=40969 by

H=32H=320

The joint training objective sums cross-entropy for both modalities:

H=32H=321

This factorization supports linear-complexity autoregressive generation with respect to the number of tokens, compared to the quadratic growth characteristic of standard raster-scan AR for images. Empirical results demonstrate approximately H=32H=322 lower inference FLOPs compared to diffusion-transformer baselines such as MMDiT (Zhang et al., 5 Jan 2026).

4. Training Methodology and Stability Strategies

Training proceeds in several curriculum phases: initialization and alignment on Qwen2.5-VL with 10M text-image pairs, followed by progressive pretraining on resolutions H=32H=323 pixels using a data corpus of approximately 6 trillion tokens. Data mixture spans pure text (H=32H=324M), text-image pairs (T2I H=32H=325B, I2T H=32H=326B), image editing (H=32H=327M), and video-text data (H=32H=328M).

A scale-aware loss reweighting assigns each scale H=32H=329 a coefficient dff16,384d_\mathrm{ff} \approx 16,3840, dff16,384d_\mathrm{ff} \approx 16,3841, to equalize contributions from coarse and fine scales. Self-correction is applied via stochastic codebook sampling during encoding, training the model to predict the deterministic (“top-1”) code from noisy (“top-k”) prefixes, which mitigates error accumulation and visual artifacts. Feature inputs to the transformer are limited to upsampled residual codebook embeddings, obviating memory and compute bottlenecks from full-scale feature maps.

Supervised fine-tuning and continued training use curated, high-aesthetic, and conversational multimodal data, adapting the model for practical downstream use cases.

5. Prefix-Tuned Reinforcement Learning for Generation

NextFlow introduces a GRPO-style (Generalized Reinforcement Prefix Optimization) prefix-tuning strategy for reinforcement learning. Multiscale autoregressive generation is reframed as a Markov Decision Process: action dff16,384d_\mathrm{ff} \approx 16,3842 is the grid of tokens at scale dff16,384d_\mathrm{ff} \approx 16,3843, and the policy factorizes over spatial positions. After dff16,384d_\mathrm{ff} \approx 16,3844 trajectory rollouts per prompt, advantages dff16,384d_\mathrm{ff} \approx 16,3845 are computed and used in a PPO-style clipped objective with scale-based loss reweighting and KL penalty:

dff16,384d_\mathrm{ff} \approx 16,3846

Where only the first dff16,384d_\mathrm{ff} \approx 16,3847 coarse scales (e.g., dff16,384d_\mathrm{ff} \approx 16,3848) are fine-tuned. Finer-scale heads are frozen, constraining high-variance RL updates to global structure. This enables reward-driven improvements (e.g., for text-image alignment) while minimizing instability.

6. Comparative Evaluation

Empirical evaluation of NextFlow demonstrates:

  • Generation quality on GenEval (text-image alignment): 0.83 (RL-finetuned 0.84), surpassing previous unified AR models and matching diffusion SOTA (~0.82).
  • DPG (detailed prompt following): 86.0 (RL-tuned matches SOTA 88.3).
  • WISE (world knowledge): 0.59 (RL: 0.62, matches Qwen-Image).
  • PRISM-Bench (imagination/style): 74.7 (RL: 78.8, approaching HiDream SOTA).
  • Image editing quality: on ImgEdit (GPT-4.1 scores), NextFlow 4.44, RL 4.49; EditCanvas 7.93, RL 8.04, competitive with EMU3.5.
  • Qualitative attributes: chain-of-thought reasoning before drawing (WISE improves .60→.70 with reasoning prompts); robust in-context editing; and efficient high-resolution (1024×1024) image generation in 5 seconds on 8×A100 GPUs (%%%%39Lq=Dsemantic+λDpixelL_q = D_\text{semantic} + \lambda D_\text{pixel}40%%%% faster than both raster-scan AR and diffusion baselines).

A summary table of comparative results is provided below.

Benchmark NextFlow (Base) NextFlow (RL-finetune) Diffusion SOTA / AR SOTA
GenEval (alignment) 0.83 0.84 ~0.82 (diff.)
DPG (prompt following) 86.0 88.3 88.3 (EMU3.5)
PRISM-Bench (imagination) 74.7 78.8 75.9 (HiDream)
ImgEdit (editing, GPT-4.1) 4.44 4.49 4.41 (EMU3.5)
EditCanvas (editing, average) 7.93 8.04 8.37 (EMU3.5)
1024×1024 speed (s) 5 5 300–600 (AR/diffusion)

7. Design Implications and Significance

The NextFlow model demonstrates that a single, unified decoder-only AR transformer with hierarchical, dual-codebook visual tokenization is sufficient to achieve both state-of-the-art visual quality and highly efficient inference in multimodal domains. The next-scale prediction mechanism—combined with scale-aware training heuristics and prefix-tuned RL—resolves the scalability issues endemic to prior AR baselines and enables direct interleaved generation at high resolutions and across modalities.

Theoretical analysis and benchmarks indicate a computational advantage (approximately six-fold in inference FLOPs) over raster-scan AR models, with practical speed-ups to match. NextFlow also evidences native support for interleaved modality generation (including video and caption streams), in-context editing, and compositional reasoning. These properties establish the model as a reference architecture for unified multimodal generative modeling, bridging the prior gap between unified AR models and diffusion-based visual SOTA (Zhang et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NextFlow Model.