Autoregressive Models and Scaling (GPT Series)

Updated 1 July 2025

Autoregressive models predict each output based on prior outputs, forming sequential dependencies, while scaling involves increasing model, data, and compute capacity to enhance performance, notably seen in the GPT series across language, vision, and other domains.
Scaling laws empirically show performance improvements as model size, dataset size, and compute increase, leading to emergent capabilities like few-shot learning in language models and state-of-the-art results in vision and video generation.
Architectural principles vary across domains, utilizing decoder-only or encoder-decoder transformers with causal or random attention and evolving tokenization strategies (discrete vs. continuous) to address domain-specific challenges and enable multi-modal applications.

Autoregressive models are a class of statistical and machine learning models in which each output is predicted conditionally on previous outputs, establishing a chain of dependencies across sequences. In the context of large-scale machine learning, "scaling" refers to the systematic expansion of model capacity, dataset size, and compute to improve performance. The GPT (Generative Pre-trained Transformer) series, and its extensions into language, vision, video, and multi-modal tasks, serves as a central reference for the paper of autoregressive modeling and scaling laws. This article surveys key principles, architectural methodologies, scaling behaviors, and cross-domain applications of autoregressive models, synthesizing insights from a broad range of research across language, vision, time series, and scientific modeling.

1. Foundations of Autoregressive Models

Autoregressive (AR) models define each element of a sequence as a probabilistic function of its prior elements, typically factorizing the joint probability as: $p(x^{1}, ..., x^{n}) = \prod_{i=1}^{n} p(x^{i} \mid x^{1}, ..., x^{i-1})$ This framework underlies classical statistical models (AR, ARCH, ARMA, ARFIMA; see (1305.3243, 1508.07715, 2309.02661)) and forms the core of modern neural sequence models such as the GPT series. In language, autoregressive modeling predicts the next token conditioned on previous tokens, facilitating open-ended generation and few-shot adaptation.

Variants exist across domains:

Language: Decoder-only transformers (GPT, GPT-NeoX (2204.06745)), where each token is produced conditionally and attention masks enforce causality.
Vision: Pixels or latent patches are predicted conditionally, e.g., in iGPT or autoregressive text-to-image/video models (2206.10789, 2410.13863, 2501.05453).
Video: Visual tokens or frames are modeled in temporal order (1906.02634, 2406.10981, 2501.05453).
Finance/time series: Classical AR or ARCH-type models with memory kernels control dependencies (1305.3243, 1508.07715).

2. Scaling Laws and Model Capacity

Scaling laws describe how model performance improves as a function of model size (number of parameters), dataset size, and compute.

For LLMs (2204.06745, 2410.17891), empirical investigations demonstrate validation loss decreases as a power-law in model size and dataset size.
In vision and video (2206.10789, 2410.13863, 2501.05453), similar but sometimes slower scaling curves are observed (lower exponents for vision/video than for text, (2501.05453)), with validation loss (often log-likelihood or negative log-probability) steadily decreasing with increased capacity.

Scaling also leads to the emergence of new capabilities:

LLMs exhibit few-shot and in-context learning (2204.06745).
Large autoregressive models for images and video achieve state-of-the-art FID and Fréchet Video Distance (FVD) (2206.10789, 2406.10981, 2410.13863).
In driving and robotics, performance continues to improve with increased pretraining data and larger models (2412.14415, 2501.05453).

However, for some modalities (notably images and videos), not all architectures yield continued improvements on qualitative metrics (e.g., FID, GenEval) unless design choices such as continuous tokens and random decoding order are employed (2410.13863).

3. Architectural Principles and Variants

Language (GPT Family)

Decoder-only Transformer backbones with self-attention and causal masking.
Rotary positional embeddings improve extrapolation and efficiency (2204.06745, 2501.05453).
Parallelization strategies (tensor, pipeline, data parallelism) enable training at the 20B+ parameter scale (2204.06745).
Improved tokenization (e.g., BPE variants, whitespace handling) reduces sequence length and improves performance (2204.06745).

Vision and Text-to-Image

Encoder-decoder (Parti (2206.10789)) or decoder-only (Fluid (2410.13863), CM3Leon (2309.02591)) transformers.
Tokenization: Early methods use discrete VQ tokens (2206.10789); more recent models, such as Fluid, employ continuous tokens for higher fidelity (2410.13863).
Generation Order: Rasterized causal order (GPT style) vs random order (MaskGIT or BERT style) with bidirectional attention (2410.13863).
Random order + continuous tokens yield sustained scaling improvements on perceptual metrics (2410.13863).
Guidance and reranking methods (classifier-free guidance, contrastive decoding) improve alignment with prompts and visual fidelity (2206.10789, 2309.02591).

Video

Block-local 3D self-attention enables scalable modeling of large spatiotemporal volumes (1906.02634).
Causal temporal attention and kv-cache, adapted from LLM deployment, enable arbitrarily long context in video diffusion autoregressive models (2406.10981).
Tokenization: Discretized visual tokens via VQ methods (2501.05453) or continuous latent representations.

Multimodality

Unified token-based modeling enables handling of interleaved text and image tokens (2309.02591).
Retrieval augmentation and instruction tuning facilitate controllable, generalizable multi-modal generation (2309.02591).

Scientific and Financial Time Series

Hybrid AR/ARCH models with scaling symmetry capture fat tails, multiscaling, and volatility bursts (1305.3243).
Memory kernels—power law or geometric infinitely divisible innovations—yield long-term dependency, heavy-tailed behavior, and tractability for quantitative estimation (1508.07715, 2309.02661).

4. Performance Metrics and Empirical Evaluation

Autoregressive models are evaluated using theme-appropriate metrics:

Language: Per-token log-likelihood, perplexity, zero/few-shot accuracy on LLM benchmarks (2204.06745, 2410.17891).
Image/Text-to-Image: FID (Fréchet Inception Distance), GenEval (prompt-object alignment), PSNR for token reconstruction (2206.10789, 2410.13863).
Video: FVD, Step-FVD, $\Delta$ EdgeFD for temporal and transition consistency (1906.02634, 2406.10981).
Driving/Robotics: minADE, minFDE, collision and offroad rates, mAP (2412.14415).
Scientific time series: Scaling exponents, autocorrelation decay rates, root-mean-square displacement (1305.3243, 1508.07715).

Scaling studies consistently report that increasing data and model size yields improved losses and often downstream metrics, but with diminishing returns at very large scales if data does not keep pace (2412.14415, 2501.05453).

5. Domain-Specific Challenges and Solutions

Vision and Video

Redundancy: Adjacent video frames are highly correlated, slowing scaling gains (2501.05453). Mitigation strategies include improved tokenization or loss functions.
Token Representation: Discrete quantization can create information bottlenecks, limiting model scaling (2410.13863). Continuous tokens offer higher fidelity and scaling latitude.
Decoding Order: Causal (GPT-style) order is efficient for autoregression but can struggle with compositionality; random order with bidirectional attention handles globally conditioned generation (2410.13863).
Efficiency: kv-cache and context truncation are essential for tractable inference at long horizon (2406.10981).

Language and Multimodal

Few-shot and in-context learning emerge above critical scale (2204.06745).
Language diffusion models converted from AR models can achieve competitive reasoning and infilling capabilities at scale (2410.17891).

Scientific Applications

Modeling fat tails and volatility requires multi-kernel or infinitely divisible processes, which can be represented as mixtures or via memory kernels (1305.3243, 2309.02661).
Parameter estimation: Analytical tractability is improved in models with closed-form Laplace transforms and method-of-moments estimators (2309.02661).

6. Practical Implications and Future Directions

Generalization across domains: Autoregressive modeling and scaling laws are robust across language, vision, video, and even behavior modeling for autonomous driving (2412.14415).
Minimal inductive bias: Large-scale AR transformers with little domain-specific architectural bias can learn rich representations transferable to diverse downstream tasks (2501.05453).
End-to-end tokenization: Non-learned tokenizers (e.g., dVAE) may limit representation quality, especially for generation. Future research will likely emphasize learned, possibly continuous, tokenizers (2501.05453, 2410.13863).
Unified multimodal generative models, capable of both text and image (or video) generation and infilling, are enabled by autoregressive transformer architectures (2309.02591).
Hybrid paradigms: Diffusion and AR models are increasingly connected, with recent work translating AR LMs into diffusion LMs while preserving scaling and performance (2410.17891).
Data scaling as bottleneck: For very large models, data scale (and diversity) becomes the limiting factor for improvement (2412.14415).
Downstream application readiness: AR models scaled with sufficient data and compute reach or surpass the practical requirements for deployment in tasks requiring robustness and long-range temporal or contextual modeling in vision, video, or control (2412.14415, 2501.05453).

7. Summary Table: Scaling and Design Considerations in Autoregressive Models

Domain/Modality	Tokenization	Order/Attention	Best Scaling Design	Key Metrics
Language	Discrete (BPE)	Causal (GPT)	Large models + long context	Perplexity, accuracy
Image	Discrete (VQ/VQGAN)	Causal or random	Continuous tokens, random order (2410.13863)	FID, GenEval
Video	Discrete (dVAE)	Block/casual (3D)	Causal attention, kv-cache (2406.10981)	FVD, Step-FVD
Multi-modal	Discrete (VQ/CLIP)	Causal + retrieval	Token-based, retrieval-aug, contrastive dec. (2309.02591)	FID, CIDEr, VQA acc
Driving	Discrete	Causal (LLM style)	AR decoder with massive data (2412.14415)	mADE, mFDE, MR
Finance	Real-valued	AR/ARCH	Hybrid AR+regime, scaling kernels (1305.3243)	Volatility clustering
Diffusion LM	Discrete	Bidirectional	AR-to-diffusion adaption (2410.17891)	Reasoning, infilling

Autoregressive modeling, exemplified by the GPT series and its extensions, demonstrates robust scaling properties across a spectrum of domains when architectures, tokenizations, and generation orders are aligned with the domain structure. Scaling both model and data yields emergent capabilities and increasingly competitive or superior results on standard benchmarks. Continued research focuses on refining token representation (especially in vision), optimizing computational efficiency, and generalizing autoregressive modeling principles for cross-modal and scientific applications.