NSARM: Next-Scale Autoregressive Modeling

Updated 5 October 2025

NSARM is an autoregressive generative framework that shifts from next-token to hierarchical, multi-scale residual prediction, enhancing image super-resolution.
It leverages bitwise quantization and hierarchical residual decomposition with pathway-aligned initialization to preserve fine image details and boost efficiency.
Experiments show NSARM achieves state-of-the-art fidelity, faster inference, and robustness against degradations compared to traditional diffusion-based methods.

Next-Scale Autoregressive Modeling (NSARM) is an autoregressive generative framework that shifts the fundamental prediction unit from conventional next-token prediction to hierarchical, multi-resolution next-scale prediction. By leveraging bitwise quantization and hierarchical residual decomposition, NSARM enables efficient, robust, and richly-structured image synthesis, with particular impact on real-world image super-resolution (Real-ISR). NSARM combines pathway-aligned initialization, scalable autoregressive modeling, and end-to-end fine-tuning to achieve state-of-the-art fidelity, generalization, and computational efficiency.

1. Principle of Next-Scale Prediction

NSARM is based on the next-scale prediction paradigm, in which an input (e.g., a low-resolution image) is progressively refined through hierarchical residual token maps at multiple spatial scales, rather than generating each pixel or token sequentially at full resolution. This approach is formulated as:

$p(R_1, ..., R_K) = \prod_{k=1}^K p(R_k | R_1, ..., R_{k-1}, c)$

where $R_k$ denotes the bitwise quantized residual token map at scale $k$ , $c$ is an optional conditioning signal (such as a text prompt), and $K$ is the total number of scales.

Key technical features:

Hierarchical refinement: Generation proceeds from coarse scales (low frequency details) to fine scales (high frequency details).
Bitwise quantization: Each residual map is converted to bitwise tokens with dimensionality $d$ , replacing the traditional fixed-size codebook with an "infinite-vocabulary" classifier (parallel binary classifiers for each dimension).
Cascaded modification: The early scales in the autoregressive sequence are initialized from a pathway-aligned transformation network conditioned on the low-resolution input.

This coarse-to-fine modeling preserves structure, enables scalable computation, and allows robust conditioning on degraded inputs.

2. Two-Stage Training Strategy

NSARM employs a two-stage optimization protocol:

Stage 1: Transformation Network Training

A lightweight transformation network $\mathcal{T}(\cdot)$ is trained with mean-squared error loss in the feature domain:

$\mathcal{L}_{s_1} = \frac{1}{K_t} \sum_{k=1}^{K_t} \lVert R_k - \mathcal{T}(I_{lr})_k \rVert^2$

where $K_t$ is the number of preliminary scales, $R_k$ ground-truth residuals, and $I_{lr}$ the low-resolution image.

Stage 2: End-to-End Fine-Tuning

With preliminary scales seeded by the transformation network, the full autoregressive pipeline is fine-tuned end-to-end. The loss for scale-wise bitwise cross-entropy prediction is:

$\mathcal{L}_{s_2} = - \frac{1}{N} \sum_{i} [ r_i^{(MGT)} \cdot \log(p(r_i^{(pred)})) ]$

where $r_i^{(MGT)}$ are modified ground-truth bitwise tokens and $N$ is the number of tokens per scale.

In this stage, the model autoregressively predicts the remaining residual scales, leveraging pathway-aligned initialization for robust adaptation.

3. Hierarchical Residual Decomposition and Bitwise Quantization

Image decomposition is performed using upsampling and downsampling operators. Given a ground-truth high-resolution image $I$ , latent features $F$ are computed via an encoder, with residual construction:

$R_k = \operatorname{down}(F - F_{k-1}, (h_k, w_k)), \quad F_k = \sum_{i=1}^k \operatorname{up}(R_i, (h, w))$

Bitwise quantization ( $\mathrm{BSQ}$ ) of residuals transforms $R_k$ into binary tokens, with each token representing a $d$ -dimensional binary code. The cross-entropy-based classifier operates as $O(d)$ parallel binary classifiers, replacing the usual discrete codebook with an “infinite vocabulary.”

This decomposition and quantization ensure that information is preserved across scales and output is robust to diverse input degradations.

4. Comparison with Prior Real-ISR and Diffusion Frameworks

NSARM displays distinctive advantages over both diffusion-based and classical direct super-resolution models:

Fidelity: Unlike T2I diffusion-based models (e.g., SUPIR, SeeSR) that may introduce hallucinated textures and over-sharpened features, NSARM reconstructs authentic fine details such as facial features, hair, and edges without spurious artifacts.
Inference Speed: Next-scale prediction combined with bitwise quantization results in much faster inference—up to 10× faster than iterative diffusion.
Robustness: By initializing the generative pathway via transformation from the LR input, NSARM is highly robust across varying degradations and avoids catastrophic failures or abrupt score drops observed in other methods.
Generalization: NSARM’s performance remains stable over a broad range of input qualities, minimizing failure cases even on challenging real-world samples.

5. Quantitative and Qualitative Experiments

NSARM’s efficacy is validated through extensive quantitative and qualitative experiments, including:

Full-reference metrics: PSNR, SSIM, LPIPS for fidelity.
No-reference perceptual metrics: NIQE, CLIPIQA, MUSIQ, MANIQA, TOPIQ.
Robustness analysis: Sorted score distributions show gradual performance decay rather than sudden drops.
Human studies: User studies consistently favor NSARM outputs over diffusion-based and direct methods.

Representative findings include faithful reconstruction of fine anatomical details, preservation of textures, and avoidance of both over-smoothing and artifact introduction.

6. Applications and Broader Implications

The NSARM framework is foundational for several application domains:

Real-world deployment: Efficient, robust super-resolution for mobile, surveillance, or restoration contexts where image degradations are unknown or variable.
Content restoration: Progressive, perceptually consistent refinement of degraded or low-resolution images.
Low-level vision: Image enhancement, inpainting, and other related synthesis tasks that benefit from generative priors and multi-scale structure.

NSARM’s demonstration of scalable, robust next-scale autoregressive modeling opens avenues for further research in generative modeling, bridging discrete token representations with perceptual quality, and more efficient adaptation protocols for scaling to higher resolutions or more complex modalities.

In summary, Next-Scale Autoregressive Modeling (NSARM) establishes a paradigm for robust, efficient image synthesis and super-resolution by combining hierarchical next-scale prediction, pathway-aligned initialization, bitwise quantization, and scalable autoregressive learning. Evaluations consistently demonstrate NSARM’s superiority in visual fidelity, inference speed, and robustness over prior state-of-the-art methods, indicating its promise for both research and real-world deployment in low-level vision and generative modeling (Kong et al., 1 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Next-Scale Autoregressive Modeling (NSARM).