NSP: Hierarchical Generative Modeling

Updated 27 December 2025

Next-Scale Prediction (NSP) is a hierarchical generative modeling method that decomposes structured data into progressive scales, enabling efficient coarse-to-fine prediction.
It uses multi-scale tokenization and autoregressive generation to model dependencies in images, audio, graphs, and 3D structures through parallel token prediction.
NSP reduces computational complexity and boosts fidelity, often outperforming traditional next-token and diffusion approaches in efficiency and quality.

Next-Scale Prediction (NSP) is a hierarchical generative modeling paradigm that factorizes the synthesis of structured data—such as images, audio, graphs, 3D scenes, and point clouds—across a sequence of coarse-to-fine scales, rather than via the conventional next-token or next-pixel paradigm. NSP has emerged as a theoretically principled and practically efficient alternative for transforming autoregressive (AR) models into scalable, high-fidelity generators, frequently surpassing both diffusion models and traditional AR techniques across diverse domains (Tian et al., 3 Apr 2024, Qiu et al., 16 Aug 2024, Belkadi et al., 30 Mar 2025, Gailhard et al., 2 Jun 2025, Jin et al., 4 Sep 2025, Kong et al., 1 Oct 2025, Meng et al., 7 Oct 2025, Li et al., 24 Nov 2025, Zhang et al., 28 Nov 2025, Zhou et al., 6 Dec 2025, Li et al., 15 Dec 2025, Shan et al., 24 Dec 2025, Chen et al., 28 Feb 2025).

1. Mathematical Foundations and General Formulation

At the core of NSP is the autoregressive factorization over multiple scales of data rather than atomic elements. Consider a structured object (e.g., an image, 3D point cloud, or graph) encoded by a multi-scale decomposition into $K$ progressively finer representations $R = (r_1, r_2, \ldots, r_K)$ , where each $r_k$ denotes the discrete tokens or features at scale $k$ and $h_1\times w_1 < h_2\times w_2 < \cdots < h_K\times w_K$ (for images). NSP posits the following joint distribution factorization:

$p(R) = p(r_1) \prod_{k=2}^{K} p(r_k \mid r_{<k}),$

or equivalently for images,

$P(x^{(\mathrm{final})}) = \prod_{s=1}^S P(x^{(r_s)} \mid x^{(<r_s)}),$

where each $r_s$ is a lower-resolution or coarser representation of the data (Tian et al., 3 Apr 2024, Belkadi et al., 30 Mar 2025, Kong et al., 1 Oct 2025). Sampling and training occur in a scale-wise fashion, where at each step, the model predicts all tokens of the next scale in parallel, conditioned on all previously predicted coarser scales (Tian et al., 3 Apr 2024).

This formulation generalizes to audio (via scale-level acoustic tokenization (Qiu et al., 16 Aug 2024)), graphs (via hierarchical latent maps (Belkadi et al., 30 Mar 2025, Gailhard et al., 2 Jun 2025)), 3D occupancy grids and point clouds (via level-of-detail coarse-to-fine hierarchies (Meng et al., 7 Oct 2025, Jin et al., 4 Sep 2025)), and more, with the only essential requirement being an invertible multi-scale encoding and decoding framework (typically based on VQ-VAE or its variants).

2. Algorithmic Structures and Model Architectures

The NSP paradigm is typically instantiated as a two-stage generative pipeline:

Multi-scale Tokenization: Data are encoded by multi-resolution quantization, such as VQ-VAE (Tian et al., 3 Apr 2024), progressive residual quantization (Qiu et al., 16 Aug 2024), or graph/point-cloud specific equivariant encoders (Belkadi et al., 30 Mar 2025, Meng et al., 7 Oct 2025). For images, each scale's latent token-map $r_k$ is obtained by downsampling and quantization; for graphs, hierarchical coarsening and quantization establish the scale pyramid (Gailhard et al., 2 Jun 2025).
Autoregressive Generation: A GPT-style decoder-only Transformer or specialized AR architecture autoregressively predicts the token-map at each scale, conditioning on all previous coarser scales. Within each scale, tokens are predicted in parallel using dense intra-scale attention blocks; inter-scale dependencies are enforced through causal masking (Tian et al., 3 Apr 2024, Belkadi et al., 30 Mar 2025).

Architectural specifics may include:

Scale embeddings and explicit positional encodings per scale
Blockwise attention masks to allow dense intra-scale but strictly causal inter-scale communication
Adaptive layer normalization or class-conditional adapters
For Markovian or efficiency-focused variants (e.g., Markov-VAR), compressed history windows replace full-context dependencies, further reducing computational cost (Zhang et al., 28 Nov 2025)

For audio, the Scale-level Audio Tokenizer (SAT) with residual quantization enables the AR generator to operate on much shorter token sequences, realizing $\mathbf{35}\times$ faster inference and + $\mathbf{1.33}$ Fréchet Audio Distance (FAD) compared to token-sequential AR baselines on AudioSet (Qiu et al., 16 Aug 2024).

3. Applications Across Data Modalities

The NSP paradigm underlies several state-of-the-art frameworks:

Images: VAR (Visual AutoRegressive modeling) achieves FID = 1.80 on ImageNet 256×256, IS = 356.4, and ~20× faster inference than diffusion models; NSP unlocks scalable AR for image in/out-painting, class-conditional editing, and exhibits LLM-like power-law scaling and zero-shot generalization (Tian et al., 3 Apr 2024).
Audio: AAR (Acoustic AutoRegressive) modeling with NSP achieves $\mathbf{35}\times$ inference speedup and improved FAD on AudioSet, demonstrating scaling to LLM-integrated AR audio systems (Qiu et al., 16 Aug 2024).
Graphs and Hypergraphs: MAG employs NSP for permutation-invariant diffusion-free generation with complexity $O(N^2\log N)$ versus $O(N^3)$ for node-sequential AR, yielding up to $10^3\times$ faster inference and competitive metrics on generic and molecular graphs (Belkadi et al., 30 Mar 2025). FAHNES generalizes NSP to topology-featured hypergraphs, leveraging coarsening and expansion with budget mechanisms for controlled graph expansion (Gailhard et al., 2 Jun 2025).
Point Clouds: PointNSP demonstrates AR generation that preserves permutation invariance and global structure by coarse-to-fine scale modeling, beating diffusion-based baselines on ShapeNet for both fidelity and efficiency (e.g., sampling time 3.5 s vs. 23–31 s for diffusion) (Meng et al., 7 Oct 2025).
3D Occupancy Forecasting: OccTENS employs a temporal NSP (TENS) factorization to achieve accurate, controllable 4D occupancy grid and ego-motion prediction, outperforming prior AR and diffusion models in accuracy and inference efficiency (Jin et al., 4 Sep 2025).
Image Super-Resolution: NSARM couples NSP with bitwise tokenization to surpass diffusion and direct AR in real-world image super-resolution benchmarks, achieving both high perceptual quality and robust generalization with 1.2 s inference for a 1024x1024 image (Kong et al., 1 Oct 2025).
Self-Supervised Image Denoising: NSP decouples noise decorrelation from detail restoration, establishing new state-of-the-art on SIDD and DND, and enabling super-resolution for noisy images without retraining (Shan et al., 24 Dec 2025).
Medical Segmentation: AR-Seg utilizes next-scale mask prediction to model explicit inter-scale dependencies, outperforming diffusion and pixel-sequential AR baselines in probabilistic and deterministic segmentation (Chen et al., 28 Feb 2025).
Video, Unified Causal Prediction: Next Scene Prediction models combine flow-matched latent diffusion with AR, joint multimodal backbones, and reinforcement learning to achieve strong causal consistency and temporal reasoning in video synthesis (Li et al., 15 Dec 2025).

4. Efficiency Advantages and Empirical Performance

The hierarchical, scale-wise factorization leads to orders-of-magnitude improvements in modeling and sampling efficiency:

Computational Cost: NSP-based AR models reduce inference complexity from $O(HW)$ steps (pixel-wise AR) or $O(TN^2)$ (node-wise graphs) to $O(\log HW)$ or $O(N^2\log N)$ via scale-wise parallelization (Tian et al., 3 Apr 2024, Belkadi et al., 30 Mar 2025, Meng et al., 7 Oct 2025).
Empirical Results: On ImageNet 256x256, VAR achieves FID = 1.80 using 10 AR passes, while diffusion transformers require 250 denoising steps for FID = 2.27 (Tian et al., 3 Apr 2024). In audio, AAR with NSP is $\mathbf{35}\times$ faster than next-token AR (Qiu et al., 16 Aug 2024).
Scalability: Power-law scaling and predictable cross-entropy/test error curves have been demonstrated over six orders of magnitude in compute and data (Tian et al., 3 Apr 2024). Markov-VAR further reduces memory by 83.8% compared to full-context VAR at 1024x1024 (Zhang et al., 28 Nov 2025).
Robustness and Generalization: NSP-conditioning on early, coarsened scales regularizes generation, curbing "hallucinations" in super-resolution (Kong et al., 1 Oct 2025) and enabling robust uncertainty quantification in medical segmentation (Chen et al., 28 Feb 2025).

5. Theoretical and Methodological Insights

NSP provides key benefits over next-token approaches:

Hierarchical Coherence: By explicitly modeling dependencies from coarse to fine, NSP endows AR models with the inductive bias of incremental structure discovery, aligning with human perceptual organization (Tian et al., 3 Apr 2024).
Permutation Invariance: For point clouds and graphs, NSP maintains permutation equivariance, eliminating biases imposed by arbitrary token orderings (Belkadi et al., 30 Mar 2025, Meng et al., 7 Oct 2025).
Exposure Bias and Stability: Exposure bias (train–test mismatch due to teacher forcing) is mitigated using procedures such as Stagger-Scale Rollout and Contrastive Student-Forcing Loss, maintaining high-fidelity generation without loss of throughput (Zhou et al., 6 Dec 2025).
Resolution of Aliasing: FVAR addresses the spectral aliasing intrinsic to uniform downsampling in VAR by adopting blur-to-clarity progressive focusing and teacher-student residual distillation (Li et al., 24 Nov 2025).

A plausible implication is that the coarse-to-fine decomposition underlying NSP offers an architecture-agnostic and domain-general pattern for scaling AR modeling to high dimensions, as supported by its successes across vision, audio, graphs, and medical domains.

6. Extensions, Variants, and Open Problems

Markovian Scale Prediction: VAR can be reformulated as a Markov process in scale, leveraging a sliding window of compressed scale histories, achieving quadratic rather than quartic scaling in attention cost and empirically lowering FID by 10.5% (Zhang et al., 28 Nov 2025).
Temporal Next-Scale Prediction: OccTENS demonstrates how NSP can be extended to time-sequences for 4D generative modeling, enabling efficient and controllable motion and occupancy prediction (Jin et al., 4 Sep 2025).
Feature-Aware Hypergraphs and Heterogeneous Data: The FAHNES framework extends NSP with budget vector tracking and cluster splitting to control granularity and feature expansion, applicable for bipartite, pure, and heterogeneous graphs (Gailhard et al., 2 Jun 2025).
Learning Theory and Complexity: While NSP provides richer supervision in sequence learning, as in the next symbol prediction setting for regular languages, hardness results establish that, under cryptographic assumptions, the computational intractability remains for several concept classes—even when NSP labels contain all next-symbol continuation information (Bhattamishra et al., 21 Oct 2025).

Open questions include principled ablations of dependency range (Markov vs. full context), optimality of scale schedules, extensions to streaming or on-the-fly scale conditioning, and further generalization to non-hierarchical or mixed-modality domains.

7. Summary Table: NSP in Different Domains

Domain/Task	NSP Instantiation	Efficiency Speedup/Key Metrics
Image Synthesis (VAR)	Multi-scale AR over VQ latents (Tian et al., 3 Apr 2024)	$\sim 20\times$ faster, FID 1.80 on ImageNet
Audio Generation (AAR)	Scale-level AR over SAT (Qiu et al., 16 Aug 2024)	$35\times$ faster, +1.33 FAD (AudioSet)
Graph/Hypergraph Generation	AR over latent hierarchies (Belkadi et al., 30 Mar 2025, Gailhard et al., 2 Jun 2025)	$10^3\times$ speedup, high validity
3D Point Cloud Gen (PointNSP)	LoD AR over discrete tokens (Meng et al., 7 Oct 2025)	$\sim6\times$ faster vs. diffusion, SOTA CD/EMD
Medical Segmentation (AR-Seg)	Next-scale mask AR (Chen et al., 28 Feb 2025)	Outperforms diffusion/AR baselines (Dice↑, GED↓)
Super-Resolution (NSARM)	Bitwise NSP + transform (Kong et al., 1 Oct 2025)	1.2 s for 1024x1024, robust quality
Self-supervised Denoising	Cross-scale NSP (Shan et al., 24 Dec 2025)	State-of-the-art PSNR/SSIM and built-in SR
Occupancy Forecasting (TENS)	Spatiotemporal NSP (Jin et al., 4 Sep 2025)	Outperforms AR/diffusion in mIoU, speed
Video/Scene Prediction	NSP as causal flow-matching (Li et al., 15 Dec 2025)	>3 $\times$ causal consistency vs. baselines

NSP thus constitutes a general theory and set of methodologies for scalable coarse-to-fine generative modeling, enabling high performance, efficiency, and generalization in domains previously limited by sequential bottlenecks or permutation constraints.