Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

96 tokens/sec

Gemini 2.5 Pro Premium

42 tokens/sec

GPT-5 Medium

20 tokens/sec

GPT-5 High Premium

27 tokens/sec

GPT-4o

100 tokens/sec

DeepSeek R1 via Azure Premium

86 tokens/sec

GPT OSS 120B via Groq Premium

464 tokens/sec

Kimi K2 via Groq Premium

181 tokens/sec

2000 character limit reached

Autoregressive Models and Scaling (GPT Series)

Updated 1 July 2025

Autoregressive models predict each output based on prior outputs, forming sequential dependencies, while scaling involves increasing model, data, and compute capacity to enhance performance, notably seen in the GPT series across language, vision, and other domains.
Scaling laws empirically show performance improvements as model size, dataset size, and compute increase, leading to emergent capabilities like few-shot learning in language models and state-of-the-art results in vision and video generation.
Architectural principles vary across domains, utilizing decoder-only or encoder-decoder transformers with causal or random attention and evolving tokenization strategies (discrete vs. continuous) to address domain-specific challenges and enable multi-modal applications.

Autoregressive models are a class of statistical and machine learning models in which each output is predicted conditionally on previous outputs, establishing a chain of dependencies across sequences. In the context of large-scale machine learning, "scaling" refers to the systematic expansion of model capacity, dataset size, and compute to improve performance. The GPT (Generative Pre-trained Transformer) series, and its extensions into language, vision, video, and multi-modal tasks, serves as a central reference for the paper of autoregressive modeling and scaling laws. This article surveys key principles, architectural methodologies, scaling behaviors, and cross-domain applications of autoregressive models, synthesizing insights from a broad range of research across language, vision, time series, and scientific modeling.

1. Foundations of Autoregressive Models

Autoregressive (AR) models define each element of a sequence as a probabilistic function of its prior elements, typically factorizing the joint probability as: $p(x^{1}, ..., x^{n}) = \prod_{i=1}^{n} p(x^{i} \mid x^{1}, ..., x^{i-1})$ This framework underlies classical statistical models (AR, ARCH, ARMA, ARFIMA; see (Zamparo et al., 2013, Sakaguchi et al., 2015, Dhull et al., 2023)) and forms the core of modern neural sequence models such as the GPT series. In language, autoregressive modeling predicts the next token conditioned on previous tokens, facilitating open-ended generation and few-shot adaptation.

Variants exist across domains:

Language: Decoder-only transformers (GPT, GPT-NeoX (Black et al., 2022)), where each token is produced conditionally and attention masks enforce causality.
Vision: Pixels or latent patches are predicted conditionally, e.g., in iGPT or autoregressive text-to-image/video models (Yu et al., 2022, Fan et al., 17 Oct 2024, Rajasegaran et al., 9 Jan 2025).
Video: Visual tokens or frames are modeled in temporal order (Weissenborn et al., 2019, Gao et al., 16 Jun 2024, Rajasegaran et al., 9 Jan 2025).
Finance/time series: Classical AR or ARCH-type models with memory kernels control dependencies (Zamparo et al., 2013, Sakaguchi et al., 2015).

2. Scaling Laws and Model Capacity

Scaling laws describe how model performance improves as a function of model size (number of parameters), dataset size, and compute.

For LLMs (Black et al., 2022, Gong et al., 23 Oct 2024), empirical investigations demonstrate validation loss decreases as a power-law in model size and dataset size.
In vision and video (Yu et al., 2022, Fan et al., 17 Oct 2024, Rajasegaran et al., 9 Jan 2025), similar but sometimes slower scaling curves are observed (lower exponents for vision/video than for text, (Rajasegaran et al., 9 Jan 2025)), with validation loss (often log-likelihood or negative log-probability) steadily decreasing with increased capacity.

Scaling also leads to the emergence of new capabilities:

LLMs exhibit few-shot and in-context learning (Black et al., 2022).
Large autoregressive models for images and video achieve state-of-the-art FID and Fréchet Video Distance (FVD) (Yu et al., 2022, Gao et al., 16 Jun 2024, Fan et al., 17 Oct 2024).
In driving and robotics, performance continues to improve with increased pretraining data and larger models (Huang et al., 19 Dec 2024, Rajasegaran et al., 9 Jan 2025).

However, for some modalities (notably images and videos), not all architectures yield continued improvements on qualitative metrics (e.g., FID, GenEval) unless design choices such as continuous tokens and random decoding order are employed (Fan et al., 17 Oct 2024).

3. Architectural Principles and Variants

Language (GPT Family)

Decoder-only Transformer backbones with self-attention and causal masking.
Rotary positional embeddings improve extrapolation and efficiency (Black et al., 2022, Rajasegaran et al., 9 Jan 2025).
Parallelization strategies (tensor, pipeline, data parallelism) enable training at the 20B+ parameter scale (Black et al., 2022).
Improved tokenization (e.g., BPE variants, whitespace handling) reduces sequence length and improves performance (Black et al., 2022).

Vision and Text-to-Image

Encoder-decoder (Parti (Yu et al., 2022)) or decoder-only (Fluid (Fan et al., 17 Oct 2024), CM3Leon (Yu et al., 2023)) transformers.
Tokenization: Early methods use discrete VQ tokens (Yu et al., 2022); more recent models, such as Fluid, employ continuous tokens for higher fidelity (Fan et al., 17 Oct 2024).
Generation Order: Rasterized causal order (GPT style) vs random order (MaskGIT or BERT style) with bidirectional attention (Fan et al., 17 Oct 2024).
Random order + continuous tokens yield sustained scaling improvements on perceptual metrics (Fan et al., 17 Oct 2024).
Guidance and reranking methods (classifier-free guidance, contrastive decoding) improve alignment with prompts and visual fidelity (Yu et al., 2022, Yu et al., 2023).

Video

Block-local 3D self-attention enables scalable modeling of large spatiotemporal volumes (Weissenborn et al., 2019).
Causal temporal attention and kv-cache, adapted from LLM deployment, enable arbitrarily long context in video diffusion autoregressive models (Gao et al., 16 Jun 2024).
Tokenization: Discretized visual tokens via VQ methods (Rajasegaran et al., 9 Jan 2025) or continuous latent representations.

Multimodality

Unified token-based modeling enables handling of interleaved text and image tokens (Yu et al., 2023).
Retrieval augmentation and instruction tuning facilitate controllable, generalizable multi-modal generation (Yu et al., 2023).

Scientific and Financial Time Series

Hybrid AR/ARCH models with scaling symmetry capture fat tails, multiscaling, and volatility bursts (Zamparo et al., 2013).
Memory kernels—power law or geometric infinitely divisible innovations—yield long-term dependency, heavy-tailed behavior, and tractability for quantitative estimation (Sakaguchi et al., 2015, Dhull et al., 2023).

4. Performance Metrics and Empirical Evaluation

Autoregressive models are evaluated using theme-appropriate metrics:

Language: Per-token log-likelihood, perplexity, zero/few-shot accuracy on LLM benchmarks (Black et al., 2022, Gong et al., 23 Oct 2024).
Image/Text-to-Image: FID (Fréchet Inception Distance), GenEval (prompt-object alignment), PSNR for token reconstruction (Yu et al., 2022, Fan et al., 17 Oct 2024).
Video: FVD, Step-FVD, $\Delta$ EdgeFD for temporal and transition consistency (Weissenborn et al., 2019, Gao et al., 16 Jun 2024).
Driving/Robotics: minADE, minFDE, collision and offroad rates, mAP (Huang et al., 19 Dec 2024).
Scientific time series: Scaling exponents, autocorrelation decay rates, root-mean-square displacement (Zamparo et al., 2013, Sakaguchi et al., 2015).

Scaling studies consistently report that increasing data and model size yields improved losses and often downstream metrics, but with diminishing returns at very large scales if data does not keep pace (Huang et al., 19 Dec 2024, Rajasegaran et al., 9 Jan 2025).

5. Domain-Specific Challenges and Solutions

Vision and Video

Redundancy: Adjacent video frames are highly correlated, slowing scaling gains (Rajasegaran et al., 9 Jan 2025). Mitigation strategies include improved tokenization or loss functions.
Token Representation: Discrete quantization can create information bottlenecks, limiting model scaling (Fan et al., 17 Oct 2024). Continuous tokens offer higher fidelity and scaling latitude.
Decoding Order: Causal (GPT-style) order is efficient for autoregression but can struggle with compositionality; random order with bidirectional attention handles globally conditioned generation (Fan et al., 17 Oct 2024).
Efficiency: kv-cache and context truncation are essential for tractable inference at long horizon (Gao et al., 16 Jun 2024).

Language and Multimodal

Few-shot and in-context learning emerge above critical scale (Black et al., 2022).
Language diffusion models converted from AR models can achieve competitive reasoning and infilling capabilities at scale (Gong et al., 23 Oct 2024).

Scientific Applications

Modeling fat tails and volatility requires multi-kernel or infinitely divisible processes, which can be represented as mixtures or via memory kernels (Zamparo et al., 2013, Dhull et al., 2023).
Parameter estimation: Analytical tractability is improved in models with closed-form Laplace transforms and method-of-moments estimators (Dhull et al., 2023).

6. Practical Implications and Future Directions

Generalization across domains: Autoregressive modeling and scaling laws are robust across language, vision, video, and even behavior modeling for autonomous driving (Huang et al., 19 Dec 2024).
Minimal inductive bias: Large-scale AR transformers with little domain-specific architectural bias can learn rich representations transferable to diverse downstream tasks (Rajasegaran et al., 9 Jan 2025).
End-to-end tokenization: Non-learned tokenizers (e.g., dVAE) may limit representation quality, especially for generation. Future research will likely emphasize learned, possibly continuous, tokenizers (Rajasegaran et al., 9 Jan 2025, Fan et al., 17 Oct 2024).
Unified multimodal generative models, capable of both text and image (or video) generation and infilling, are enabled by autoregressive transformer architectures (Yu et al., 2023).
Hybrid paradigms: Diffusion and AR models are increasingly connected, with recent work translating AR LMs into diffusion LMs while preserving scaling and performance (Gong et al., 23 Oct 2024).
Data scaling as bottleneck: For very large models, data scale (and diversity) becomes the limiting factor for improvement (Huang et al., 19 Dec 2024).
Downstream application readiness: AR models scaled with sufficient data and compute reach or surpass the practical requirements for deployment in tasks requiring robustness and long-range temporal or contextual modeling in vision, video, or control (Huang et al., 19 Dec 2024, Rajasegaran et al., 9 Jan 2025).

7. Summary Table: Scaling and Design Considerations in Autoregressive Models

Domain/Modality	Tokenization	Order/Attention	Best Scaling Design	Key Metrics
Language	Discrete (BPE)	Causal (GPT)	Large models + long context	Perplexity, accuracy
Image	Discrete (VQ/VQGAN)	Causal or random	Continuous tokens, random order (Fan et al., 17 Oct 2024)	FID, GenEval
Video	Discrete (dVAE)	Block/casual (3D)	Causal attention, kv-cache (Gao et al., 16 Jun 2024)	FVD, Step-FVD
Multi-modal	Discrete (VQ/CLIP)	Causal + retrieval	Token-based, retrieval-aug, contrastive dec. (Yu et al., 2023)	FID, CIDEr, VQA acc
Driving	Discrete	Causal (LLM style)	AR decoder with massive data (Huang et al., 19 Dec 2024)	mADE, mFDE, MR
Finance	Real-valued	AR/ARCH	Hybrid AR+regime, scaling kernels (Zamparo et al., 2013)	Volatility clustering
Diffusion LM	Discrete	Bidirectional	AR-to-diffusion adaption (Gong et al., 23 Oct 2024)	Reasoning, infilling

Autoregressive modeling, exemplified by the GPT series and its extensions, demonstrates robust scaling properties across a spectrum of domains when architectures, tokenizations, and generation orders are aligned with the domain structure. Scaling both model and data yields emergent capabilities and increasingly competitive or superior results on standard benchmarks. Continued research focuses on refining token representation (especially in vision), optimizing computational efficiency, and generalizing autoregressive modeling principles for cross-modal and scientific applications.