Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autoregressive Models and Scaling (GPT Series)

Updated 29 June 2025

Autoregressive models are a class of statistical and machine learning models in which each output is predicted conditionally on previous outputs, establishing a chain of dependencies across sequences. In the context of large-scale machine learning, "scaling" refers to the systematic expansion of model capacity, dataset size, and compute to improve performance. The GPT (Generative Pre-trained Transformer) series, and its extensions into language, vision, video, and multi-modal tasks, serves as a central reference for the paper of autoregressive modeling and scaling laws. This article surveys key principles, architectural methodologies, scaling behaviors, and cross-domain applications of autoregressive models, synthesizing insights from a broad range of research across language, vision, time series, and scientific modeling.

1. Foundations of Autoregressive Models

Autoregressive (AR) models define each element of a sequence as a probabilistic function of its prior elements, typically factorizing the joint probability as: p(x1,...,xn)=i=1np(xix1,...,xi1)p(x^{1}, ..., x^{n}) = \prod_{i=1}^{n} p(x^{i} \mid x^{1}, ..., x^{i-1}) This framework underlies classical statistical models (AR, ARCH, ARMA, ARFIMA; see (Zamparo et al., 2013 , Sakaguchi et al., 2015 , Dhull et al., 2023 )) and forms the core of modern neural sequence models such as the GPT series. In language, autoregressive modeling predicts the next token conditioned on previous tokens, facilitating open-ended generation and few-shot adaptation.

Variants exist across domains:

2. Scaling Laws and Model Capacity

Scaling laws describe how model performance improves as a function of model size (number of parameters), dataset size, and compute.

Scaling also leads to the emergence of new capabilities:

However, for some modalities (notably images and videos), not all architectures yield continued improvements on qualitative metrics (e.g., FID, GenEval) unless design choices such as continuous tokens and random decoding order are employed (Fan et al., 17 Oct 2024 ).

3. Architectural Principles and Variants

Language (GPT Family)

Vision and Text-to-Image

Video

  • Block-local 3D self-attention enables scalable modeling of large spatiotemporal volumes (Weissenborn et al., 2019 ).
  • Causal temporal attention and kv-cache, adapted from LLM deployment, enable arbitrarily long context in video diffusion autoregressive models (Gao et al., 16 Jun 2024 ).
  • Tokenization: Discretized visual tokens via VQ methods (Rajasegaran et al., 9 Jan 2025 ) or continuous latent representations.

Multimodality

  • Unified token-based modeling enables handling of interleaved text and image tokens (Yu et al., 2023 ).
  • Retrieval augmentation and instruction tuning facilitate controllable, generalizable multi-modal generation (Yu et al., 2023 ).

Scientific and Financial Time Series

  • Hybrid AR/ARCH models with scaling symmetry capture fat tails, multiscaling, and volatility bursts (Zamparo et al., 2013 ).
  • Memory kernels—power law or geometric infinitely divisible innovations—yield long-term dependency, heavy-tailed behavior, and tractability for quantitative estimation (Sakaguchi et al., 2015 , Dhull et al., 2023 ).

4. Performance Metrics and Empirical Evaluation

Autoregressive models are evaluated using theme-appropriate metrics:

Scaling studies consistently report that increasing data and model size yields improved losses and often downstream metrics, but with diminishing returns at very large scales if data does not keep pace (Huang et al., 19 Dec 2024 , Rajasegaran et al., 9 Jan 2025 ).

5. Domain-Specific Challenges and Solutions

Vision and Video

  • Redundancy: Adjacent video frames are highly correlated, slowing scaling gains (Rajasegaran et al., 9 Jan 2025 ). Mitigation strategies include improved tokenization or loss functions.
  • Token Representation: Discrete quantization can create information bottlenecks, limiting model scaling (Fan et al., 17 Oct 2024 ). Continuous tokens offer higher fidelity and scaling latitude.
  • Decoding Order: Causal (GPT-style) order is efficient for autoregression but can struggle with compositionality; random order with bidirectional attention handles globally conditioned generation (Fan et al., 17 Oct 2024 ).
  • Efficiency: kv-cache and context truncation are essential for tractable inference at long horizon (Gao et al., 16 Jun 2024 ).

Language and Multimodal

  • Few-shot and in-context learning emerge above critical scale (Black et al., 2022 ).
  • Language diffusion models converted from AR models can achieve competitive reasoning and infilling capabilities at scale (Gong et al., 23 Oct 2024 ).

Scientific Applications

  • Modeling fat tails and volatility requires multi-kernel or infinitely divisible processes, which can be represented as mixtures or via memory kernels (Zamparo et al., 2013 , Dhull et al., 2023 ).
  • Parameter estimation: Analytical tractability is improved in models with closed-form Laplace transforms and method-of-moments estimators (Dhull et al., 2023 ).

6. Practical Implications and Future Directions

  • Generalization across domains: Autoregressive modeling and scaling laws are robust across language, vision, video, and even behavior modeling for autonomous driving (Huang et al., 19 Dec 2024 ).
  • Minimal inductive bias: Large-scale AR transformers with little domain-specific architectural bias can learn rich representations transferable to diverse downstream tasks (Rajasegaran et al., 9 Jan 2025 ).
  • End-to-end tokenization: Non-learned tokenizers (e.g., dVAE) may limit representation quality, especially for generation. Future research will likely emphasize learned, possibly continuous, tokenizers (Rajasegaran et al., 9 Jan 2025 , Fan et al., 17 Oct 2024 ).
  • Unified multimodal generative models, capable of both text and image (or video) generation and infilling, are enabled by autoregressive transformer architectures (Yu et al., 2023 ).
  • Hybrid paradigms: Diffusion and AR models are increasingly connected, with recent work translating AR LMs into diffusion LMs while preserving scaling and performance (Gong et al., 23 Oct 2024 ).
  • Data scaling as bottleneck: For very large models, data scale (and diversity) becomes the limiting factor for improvement (Huang et al., 19 Dec 2024 ).
  • Downstream application readiness: AR models scaled with sufficient data and compute reach or surpass the practical requirements for deployment in tasks requiring robustness and long-range temporal or contextual modeling in vision, video, or control (Huang et al., 19 Dec 2024 , Rajasegaran et al., 9 Jan 2025 ).

7. Summary Table: Scaling and Design Considerations in Autoregressive Models

Domain/Modality Tokenization Order/Attention Best Scaling Design Key Metrics
Language Discrete (BPE) Causal (GPT) Large models + long context Perplexity, accuracy
Image Discrete (VQ/VQGAN) Causal or random Continuous tokens, random order (Fan et al., 17 Oct 2024 ) FID, GenEval
Video Discrete (dVAE) Block/casual (3D) Causal attention, kv-cache (Gao et al., 16 Jun 2024 ) FVD, Step-FVD
Multi-modal Discrete (VQ/CLIP) Causal + retrieval Token-based, retrieval-aug, contrastive dec. (Yu et al., 2023 ) FID, CIDEr, VQA acc
Driving Discrete Causal (LLM style) AR decoder with massive data (Huang et al., 19 Dec 2024 ) mADE, mFDE, MR
Finance Real-valued AR/ARCH Hybrid AR+regime, scaling kernels (Zamparo et al., 2013 ) Volatility clustering
Diffusion LM Discrete Bidirectional AR-to-diffusion adaption (Gong et al., 23 Oct 2024 ) Reasoning, infilling

Autoregressive modeling, exemplified by the GPT series and its extensions, demonstrates robust scaling properties across a spectrum of domains when architectures, tokenizations, and generation orders are aligned with the domain structure. Scaling both model and data yields emergent capabilities and increasingly competitive or superior results on standard benchmarks. Continued research focuses on refining token representation (especially in vision), optimizing computational efficiency, and generalizing autoregressive modeling principles for cross-modal and scientific applications.