Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autoregressive Models and Scaling (GPT Series)

Updated 1 July 2025
  • Autoregressive models predict each output based on prior outputs, forming sequential dependencies, while scaling involves increasing model, data, and compute capacity to enhance performance, notably seen in the GPT series across language, vision, and other domains.
  • Scaling laws empirically show performance improvements as model size, dataset size, and compute increase, leading to emergent capabilities like few-shot learning in language models and state-of-the-art results in vision and video generation.
  • Architectural principles vary across domains, utilizing decoder-only or encoder-decoder transformers with causal or random attention and evolving tokenization strategies (discrete vs. continuous) to address domain-specific challenges and enable multi-modal applications.

Autoregressive models are a class of statistical and machine learning models in which each output is predicted conditionally on previous outputs, establishing a chain of dependencies across sequences. In the context of large-scale machine learning, "scaling" refers to the systematic expansion of model capacity, dataset size, and compute to improve performance. The GPT (Generative Pre-trained Transformer) series, and its extensions into language, vision, video, and multi-modal tasks, serves as a central reference for the paper of autoregressive modeling and scaling laws. This article surveys key principles, architectural methodologies, scaling behaviors, and cross-domain applications of autoregressive models, synthesizing insights from a broad range of research across language, vision, time series, and scientific modeling.

1. Foundations of Autoregressive Models

Autoregressive (AR) models define each element of a sequence as a probabilistic function of its prior elements, typically factorizing the joint probability as: p(x1,...,xn)=i=1np(xix1,...,xi1)p(x^{1}, ..., x^{n}) = \prod_{i=1}^{n} p(x^{i} \mid x^{1}, ..., x^{i-1}) This framework underlies classical statistical models (AR, ARCH, ARMA, ARFIMA; see (1305.3243, 1508.07715, 2309.02661)) and forms the core of modern neural sequence models such as the GPT series. In language, autoregressive modeling predicts the next token conditioned on previous tokens, facilitating open-ended generation and few-shot adaptation.

Variants exist across domains:

  • Language: Decoder-only transformers (GPT, GPT-NeoX (2204.06745)), where each token is produced conditionally and attention masks enforce causality.
  • Vision: Pixels or latent patches are predicted conditionally, e.g., in iGPT or autoregressive text-to-image/video models (2206.10789, 2410.13863, 2501.05453).
  • Video: Visual tokens or frames are modeled in temporal order (1906.02634, 2406.10981, 2501.05453).
  • Finance/time series: Classical AR or ARCH-type models with memory kernels control dependencies (1305.3243, 1508.07715).

2. Scaling Laws and Model Capacity

Scaling laws describe how model performance improves as a function of model size (number of parameters), dataset size, and compute.

  • For LLMs (2204.06745, 2410.17891), empirical investigations demonstrate validation loss decreases as a power-law in model size and dataset size.
  • In vision and video (2206.10789, 2410.13863, 2501.05453), similar but sometimes slower scaling curves are observed (lower exponents for vision/video than for text, (2501.05453)), with validation loss (often log-likelihood or negative log-probability) steadily decreasing with increased capacity.

Scaling also leads to the emergence of new capabilities:

However, for some modalities (notably images and videos), not all architectures yield continued improvements on qualitative metrics (e.g., FID, GenEval) unless design choices such as continuous tokens and random decoding order are employed (2410.13863).

3. Architectural Principles and Variants

Language (GPT Family)

  • Decoder-only Transformer backbones with self-attention and causal masking.
  • Rotary positional embeddings improve extrapolation and efficiency (2204.06745, 2501.05453).
  • Parallelization strategies (tensor, pipeline, data parallelism) enable training at the 20B+ parameter scale (2204.06745).
  • Improved tokenization (e.g., BPE variants, whitespace handling) reduces sequence length and improves performance (2204.06745).

Vision and Text-to-Image

  • Encoder-decoder (Parti (2206.10789)) or decoder-only (Fluid (2410.13863), CM3Leon (2309.02591)) transformers.
  • Tokenization: Early methods use discrete VQ tokens (2206.10789); more recent models, such as Fluid, employ continuous tokens for higher fidelity (2410.13863).
  • Generation Order: Rasterized causal order (GPT style) vs random order (MaskGIT or BERT style) with bidirectional attention (2410.13863).
  • Random order + continuous tokens yield sustained scaling improvements on perceptual metrics (2410.13863).
  • Guidance and reranking methods (classifier-free guidance, contrastive decoding) improve alignment with prompts and visual fidelity (2206.10789, 2309.02591).

Video

  • Block-local 3D self-attention enables scalable modeling of large spatiotemporal volumes (1906.02634).
  • Causal temporal attention and kv-cache, adapted from LLM deployment, enable arbitrarily long context in video diffusion autoregressive models (2406.10981).
  • Tokenization: Discretized visual tokens via VQ methods (2501.05453) or continuous latent representations.

Multimodality

  • Unified token-based modeling enables handling of interleaved text and image tokens (2309.02591).
  • Retrieval augmentation and instruction tuning facilitate controllable, generalizable multi-modal generation (2309.02591).

Scientific and Financial Time Series

  • Hybrid AR/ARCH models with scaling symmetry capture fat tails, multiscaling, and volatility bursts (1305.3243).
  • Memory kernels—power law or geometric infinitely divisible innovations—yield long-term dependency, heavy-tailed behavior, and tractability for quantitative estimation (1508.07715, 2309.02661).

4. Performance Metrics and Empirical Evaluation

Autoregressive models are evaluated using theme-appropriate metrics:

  • Language: Per-token log-likelihood, perplexity, zero/few-shot accuracy on LLM benchmarks (2204.06745, 2410.17891).
  • Image/Text-to-Image: FID (Fréchet Inception Distance), GenEval (prompt-object alignment), PSNR for token reconstruction (2206.10789, 2410.13863).
  • Video: FVD, Step-FVD, Δ\DeltaEdgeFD for temporal and transition consistency (1906.02634, 2406.10981).
  • Driving/Robotics: minADE, minFDE, collision and offroad rates, mAP (2412.14415).
  • Scientific time series: Scaling exponents, autocorrelation decay rates, root-mean-square displacement (1305.3243, 1508.07715).

Scaling studies consistently report that increasing data and model size yields improved losses and often downstream metrics, but with diminishing returns at very large scales if data does not keep pace (2412.14415, 2501.05453).

5. Domain-Specific Challenges and Solutions

Vision and Video

  • Redundancy: Adjacent video frames are highly correlated, slowing scaling gains (2501.05453). Mitigation strategies include improved tokenization or loss functions.
  • Token Representation: Discrete quantization can create information bottlenecks, limiting model scaling (2410.13863). Continuous tokens offer higher fidelity and scaling latitude.
  • Decoding Order: Causal (GPT-style) order is efficient for autoregression but can struggle with compositionality; random order with bidirectional attention handles globally conditioned generation (2410.13863).
  • Efficiency: kv-cache and context truncation are essential for tractable inference at long horizon (2406.10981).

Language and Multimodal

  • Few-shot and in-context learning emerge above critical scale (2204.06745).
  • Language diffusion models converted from AR models can achieve competitive reasoning and infilling capabilities at scale (2410.17891).

Scientific Applications

  • Modeling fat tails and volatility requires multi-kernel or infinitely divisible processes, which can be represented as mixtures or via memory kernels (1305.3243, 2309.02661).
  • Parameter estimation: Analytical tractability is improved in models with closed-form Laplace transforms and method-of-moments estimators (2309.02661).

6. Practical Implications and Future Directions

  • Generalization across domains: Autoregressive modeling and scaling laws are robust across language, vision, video, and even behavior modeling for autonomous driving (2412.14415).
  • Minimal inductive bias: Large-scale AR transformers with little domain-specific architectural bias can learn rich representations transferable to diverse downstream tasks (2501.05453).
  • End-to-end tokenization: Non-learned tokenizers (e.g., dVAE) may limit representation quality, especially for generation. Future research will likely emphasize learned, possibly continuous, tokenizers (2501.05453, 2410.13863).
  • Unified multimodal generative models, capable of both text and image (or video) generation and infilling, are enabled by autoregressive transformer architectures (2309.02591).
  • Hybrid paradigms: Diffusion and AR models are increasingly connected, with recent work translating AR LMs into diffusion LMs while preserving scaling and performance (2410.17891).
  • Data scaling as bottleneck: For very large models, data scale (and diversity) becomes the limiting factor for improvement (2412.14415).
  • Downstream application readiness: AR models scaled with sufficient data and compute reach or surpass the practical requirements for deployment in tasks requiring robustness and long-range temporal or contextual modeling in vision, video, or control (2412.14415, 2501.05453).

7. Summary Table: Scaling and Design Considerations in Autoregressive Models

Domain/Modality Tokenization Order/Attention Best Scaling Design Key Metrics
Language Discrete (BPE) Causal (GPT) Large models + long context Perplexity, accuracy
Image Discrete (VQ/VQGAN) Causal or random Continuous tokens, random order (2410.13863) FID, GenEval
Video Discrete (dVAE) Block/casual (3D) Causal attention, kv-cache (2406.10981) FVD, Step-FVD
Multi-modal Discrete (VQ/CLIP) Causal + retrieval Token-based, retrieval-aug, contrastive dec. (2309.02591) FID, CIDEr, VQA acc
Driving Discrete Causal (LLM style) AR decoder with massive data (2412.14415) mADE, mFDE, MR
Finance Real-valued AR/ARCH Hybrid AR+regime, scaling kernels (1305.3243) Volatility clustering
Diffusion LM Discrete Bidirectional AR-to-diffusion adaption (2410.17891) Reasoning, infilling

Autoregressive modeling, exemplified by the GPT series and its extensions, demonstrates robust scaling properties across a spectrum of domains when architectures, tokenizations, and generation orders are aligned with the domain structure. Scaling both model and data yields emergent capabilities and increasingly competitive or superior results on standard benchmarks. Continued research focuses on refining token representation (especially in vision), optimizing computational efficiency, and generalizing autoregressive modeling principles for cross-modal and scientific applications.