Visual Autoregressive Models

Updated 24 January 2026

Visual Autoregressive Models (VARs) are generative models that structure image synthesis as a coarse-to-fine, next-scale prediction process to enhance inference efficiency and quality.
They employ block-wise parallel token prediction and transformer architectures to capture both global structure and fine visual details across multiple spatial scales.
VARs integrate techniques like beam search, verifier-guided scoring, and diversity regularization to improve controllability, mitigate exposure bias, and optimize synthesis fidelity.

Visual Autoregressive Models (VARs) are a class of generative models for images and other visual modalities that structure the generative process as a hierarchy of discrete or continuous token predictions across multiple spatial scales. By shifting from the classic next-token (per-pixel or per-patch) autoregression to a next-scale paradigm, VARs achieve significantly improved inference efficiency, scalable high-resolution synthesis, and strong synthesis quality—surpassing both traditional autoregressive and diffusion-based methods in numerous benchmark settings. These models have advanced architectural frameworks, incorporate search and controllability capabilities, and are subject to ongoing research in diversity, safety, and acceleration. Their underlying principles and practical implementations unify ideas from hierarchical discrete latent variable decomposition, encoder-decoder modeling, and sequence-to-sequence prediction.

1. Mathematical Foundations and Core Architecture

VARs define image (or feature map) synthesis by predicting a sequence of residual token maps $\mathbf{R} = (R_1, R_2, \dots, R_K)$ , each corresponding to a progressively finer spatial scale (e.g., $1\times1 \rightarrow 4\times4 \rightarrow 16\times16 \dots$ ). For images quantized into a continuous feature map $F$ via a visual tokenizer (e.g., VQ-VAE), these token maps decompose the reconstruction:

$F_K = \sum_{k=1}^K \text{Upsample}(R_k; (H,W))$

The joint distribution is factorized as a sequence of next-scale predictions:

$p(R_1, ..., R_K) = \prod_{k=1}^K p(R_k \mid R_{<k})$

At each scale $k$ , a transformer block autoregressively emits all tokens $R_k$ in parallel, conditioned on previous scales, with contextual features provided by upsampled and downsampled representations from preceding steps. Model backbone variants include decoder-only transformer architectures (e.g., Infinity), and tokenization is frequently achieved through VQ-VAE or related quantization schemes, with vocabularies reaching up to $2^{56}$ unique symbols per codebook (Riise et al., 19 Oct 2025, Li et al., 18 Dec 2025).

Continuous VAR frameworks avoid quantization, directly modeling $p(x_t \mid x_{<t})$ in $\mathbb{R}^d$ , and training via strictly proper scoring rules, e.g., energy score, for accurate continuous distribution matching (Shao et al., 12 May 2025).

2. Next-Scale Prediction Paradigm and Acceleration

Next-scale prediction is the defining innovation of VARs, offering:

Block-wise parallelism: At each scale $k$ , all $h_k \times w_k$ tokens are predicted in a single pass, reducing the number of generative steps from $O(HW)$ (per-token AR) or 100–1000 (diffusion) to $K \approx 10$ –$13$.
Explicit coarse-to-fine modeling: Early scales capture semantic and structural information (global and local semantics), while later scales focus on visual detail refinement (Li et al., 18 Dec 2025).
Total runtime scaling: Standard VARs have a worst-case complexity of $O(n^4)$ (with $n$ image width), due to attention costs at high-resolution scales (Guo et al., 30 Mar 2025).
Stage-aware acceleration: StageVAR exploits the semantic irrelevance and low-rank properties of late scales, applying low-rank approximations and prompt conditioning offloading only after semantic and structural stages are complete, achieving up to 3.4 $\times$ speedup with negligible metric drop (Li et al., 18 Dec 2025). FastVAR prunes “pivotal” tokens—determined by high-frequency residuals—in late scales, further reducing memory and computation requirements (Guo et al., 30 Mar 2025).

3. Inference-Time Search and Diversity Techniques

The discrete, sequential structure of VARs enables algorithmic innovations that are not straightforward in continuous diffusion models:

Beam Search: VARs support beam search for block-level token selection. This leverages prefix caching and early pruning of low-probability branches, leading to substantial reductions (up to 46%) in function evaluations for comparable output quality relative to random sampling in diffusion (Riise et al., 19 Oct 2025).
Verifier-guided scoring: Integrating external verifiers (e.g., CLIPScore, LLaVA) into the beam objective supports optimization of compositional accuracy and complex spatial/numerical reasoning.
DiverseVAR: VARs can suffer from diversity collapse, a phenomenon analogous to distilled diffusion models with too few steps. DiverseVAR applies SVD-based soft-suppression and soft-amplification at early and output scales, respectively, to artificially dampen over-dominant feature components, thus unlocking the generative support already present in the pretrained model. This yields recall and coverage gains in COCO sampling benchmarks with minor, if any, loss in alignment and fidelity (Wang et al., 21 Nov 2025).

4. Controllability, Safety, and Alignment

Controllable Generation (CAR): The CAR framework enables plug-and-play injection of arbitrary visual control signals (e.g., edges, depth, sketch) into pretrained VARs (Yao et al., 2024). Through progressive multi-scale fusion, lightweight transformers, and learned control extractors, CAR achieves higher fidelity and finer control than prior diffusion-based adapters, while retaining fast inference times and generalization to unseen visual categories.
Surgical Concept Erasure (VARE and S-VARE): With safety being a significant concern, S-VARE introduces filtered cross-entropy and preservation losses to surgically erase unwanted visual concepts (e.g., NSFW, specific objects) with minimal effect on overall generative quality, leveraging bitwise filtering and auxiliary-token stabilization (Zhong et al., 26 Sep 2025).
Reinforcement Learning (VAR-RL): The heterogeneous scale structure induces asynchronous policy conflicts under RL training. Solutions include staged reward decomposition (VMR), dynamic time-step weighting (PANW), and spatially-propagated mask isolation—each targeting variance reduction and stable credit assignment; these cumulatively yield improved objective alignment and human preference scores relative to diffusion or vanilla GRPO baselines (Sun et al., 5 Jan 2026).

5. Dense Prediction and Pixel-level Discriminative Tasks

RemoteVAR: VARs can be adapted for dense, pixel-level discriminative tasks such as remote sensing change detection. RemoteVAR employs multi-resolution cross-attention fusion, autoregressive mask token prediction, and exposure bias mitigation via noisy teacher-forcing (“TokRand”). On benchmark datasets, this approach achieves F1/IoU/OA metrics equaling or surpassing state-of-the-art diffusion and transformer models, and is particularly strong in small-object and boundary detection (Korkmaz et al., 17 Jan 2026).
Refinement Modules: To counter error accumulation and spatial inconsistency intrinsic to sequential AR generation, post-hoc visual self-refinement modules reprocess the full token sequence using multi-head self-attention. Training only the refinement module yields improvements in LPIPS, FID, SSIM, and human-perceived semantic consistency across colorization, inpainting, and edge tasks (Wang et al., 1 Oct 2025).

6. Limitations, Scaling Laws, and Broader Implications

Despite their strengths, VARs exhibit several challenges:

Exposure bias: The mismatch between training (teacher forcing) and inference (autoregressive sampling) can degrade performance, especially in discriminative or error-sensitive settings. Techniques such as randomized history replacement (TokRand) and explicit decoder refinement partially bridge this gap (Korkmaz et al., 17 Jan 2026).
Information loss: Quantization introduces perceptual and statistical loss unavoidably absent in fully continuous models. Continuous VARs, trained by strictly proper scoring (e.g., energy score), inherit the statistical guarantee of loss-minimizing convergence but require careful selection of score exponents and generator architectures (Shao et al., 12 May 2025).
Diversity vs. fidelity: Classically, few-step VARs trend towards low variance; SVD-based regularizations and runtime trade-offs (via beam width or verifier strictness) allow explicit tuning.
Efficiency bottlenecks: High-resolution steps can still be computationally expensive. Efficient, stage-aware acceleration and token pruning, together with potential low-rank approximation, are active research areas enabling practical application in real-time, high-resolution scenarios (Li et al., 18 Dec 2025, Guo et al., 30 Mar 2025).
Broader impact: The general principles underlying VARs—coarse-to-fine sequence decomposition, prefix-aware search, and structured conditioning—have extensions to video, audio, 3D synthesis, and multimodal models, as well as to controlled or safe generative modeling in high-stakes applications (Wang et al., 21 Nov 2025, Zhong et al., 26 Sep 2025).

Table: Core Properties and Innovations in VARs

Aspect	VARs (State-of-the-art)	Impact/Benchmark
Scale prediction	Coarse-to-fine, blockwise ( $K\sim10$ )	$\sim$ 13 steps for $1024^2$
Latent representation	Discrete VQ/VQ-VAE, Binary Sph. Quant.	$2^{32}$ – $2^{56}$ vocab size
Inference-time search	Beam, GTO, verifier-guided	$+0.16$ GenEval, up to 46% FEs saved (Riise et al., 19 Oct 2025)
Diversity method	DiverseVAR (SSR+SAR)	Recall/Coverage $\sim$ 20%/5% ↑
Control framework	CAR: multi-scale visual control	FID $\sim$ 2–5 lower vs. diffusion (Yao et al., 2024)
Acceleration	FastVAR, StageVAR (token pruning, RP+RTR)	2.7–3.4 $\times$ , $<$ 1% metric drop
Safety/erasure	S-VARE (filtered CE+KL)	$>$ 97% concept removal, $\Delta$ FID/CLIP $<$ 2 (Zhong et al., 26 Sep 2025)
RL alignment	VMR, PANW, Mask Propagation	$+41.6\%$ word accuracy (Sun et al., 5 Jan 2026)

7. Future Directions

Current developments focus on integrating continuous modeling without quantization, faster and more flexible conditional or multimodal generation, efficient adaptation to video and 3D modalities, and robust, fine-grained safety/controllability mechanisms. Empirical scaling laws reveal monotonic improvements in FID/IS as model depth increases, and plug-and-play techniques (DiverseVAR, CAR, S-VARE, StageVAR, self-refinement) point to modular, reusable strategies deployable in other contexts (Guo et al., 30 Mar 2025, Yao et al., 2024). The next frontier likely lies in joint advances in tokenization, sub-scale modeling, domain adaptation, and theoretically grounded training for high-dimensional, structured visual data.