Autoregressive Language Models (ARMs)
- Autoregressive Language Models (ARMs) are probabilistic neural sequence models that decompose sequence probability into a product of conditionally computed tokens, enabling applications across text, vision, and audio.
- They leverage techniques such as teacher-forcing, causal masking, and KV caching to optimize training and enable efficient, likelihood-based inference.
- Recent innovations focus on overcoming limitations like slow sampling and limited expressivity through predictive sampling, any-order decoding, and hybrid model architectures.
Autoregressive LLMs (ARMs) are a class of probabilistic neural sequence models that factorize the probability of a sequence into a product of conditional distributions, each predicting a token conditioned on all preceding tokens in a fixed order. This chain-rule factorization underlies a wide range of recent advances in generative modeling for language, vision, audio, and multimodal data, making ARMs a foundational paradigm for likelihood-based modeling, generation, and inference.
1. Fundamental Principles of Autoregressive LLMs
ARMs model a discrete sequence by decomposing its joint probability as
where denotes the prefix . Each conditional is parameterized by a neural network (e.g., Transformer, LSTM, or masked convolution). During training, ARMs maximize the (log) likelihood of the observed data via teacher-forcing, while during generation, sequences are sampled sequentially, one token at a time.
Popular architectures include transformer-based models such as GPT, LSTM LMs, and autoregressive CNNs (e.g., PixelCNN for images, WaveNet for audio). ARMs exhibit three key properties:
- Local normalization: Each conditional is normalized, permitting efficient calculation of sequence likelihoods.
- Causal masking: The model's predictions for do not depend on , ensuring proper auto-regression.
- Flexible parameterization: They are agnostic with respect to vocabulary, tokenization, or data modality.
The sequential structure allows for highly effective likelihood-based training and supports diverse downstream applications spanning open-ended text generation, LLMing benchmarks, code synthesis, and autoregressive inference in other modalities.
2. Training Objectives, Orderings, and Generalizations
While left-to-right LLMing is canonical, ARMs can be generalized to support other factorization orders or even arbitrary-conditional inference. Notable generalizations include:
- Permuted ARMs / Arbitrary Orderings: Permuting the token order enables the model to infer any given arbitrary known subsets . This is the foundation of permutation-based models such as XLNet and Any-Order ARMs (Shih et al., 2022).
- Probabilistic Masking: Probabilistically masked LMs (PMLM) train over a distribution of masking ratios and can, with the right prior, implement an objective equivalent to autoregressive permutated models (Liao et al., 2020). This enables arbitrary position generation and controllable infilling tasks.
- Insertion models and infilling: Recent models relax the strict order requirement, enabling arbitrary-position token insertions; this results in greater flexibility for planning and sequence completion, especially in tasks where left-to-right generation is suboptimal (Patel et al., 9 May 2025).
Modern ARMs also integrate additional context—such as prefix embeddings from masked LMs—to produce richer representations and improved next-token prediction, lowering perplexity and improving generalization across domains (Zouhar et al., 2022).
3. Inference Efficiency and Sampling Algorithms
ARMs are naturally efficient for teacher-forced likelihood computation but are traditionally limited by slow sampling during autoregressive generation. Sampling efficiency and algorithmic innovations have been the subject of intense research:
- Ancestral (sequential) sampling: Generates one token at a time, incurring network evaluations per sequence. This can be prohibitively slow for long outputs or high-throughput scenarios (Wiggers et al., 2020).
- Predictive Sampling: Forecasts future outputs in bulk using either fixed-point iteration or auxiliary forecasting modules, enabling multiple tokens to be predicted and accepted per forward pass (Wiggers et al., 2020). This reduces ARM calls by up to 27.6× on simple datasets and 3.6–6.7× on complex data.
- Any-Subset Speculative Decoding (ASSD): Allows AS-ARMs to sample tokens in parallel, correcting speculative drafts to generate from the correct joint distribution. Algorithmic guarantees ensure quality is preserved without increasing the number of neural network calls beyond the number of generated tokens (Guo et al., 29 Apr 2025).
- KV Caching and Batched Inference: Standard ARMs utilize key-value caching for efficient decoding; batch-level parallelism further improves throughput. Despite low arithmetic intensity due to sequential dependence, ARMs scale well with batch size in service settings (Kim et al., 5 Oct 2025).
- Infilling and Arbitrary Conditional Generation: MARIA hybridizes AR and MLM architectures, enabling masked span infilling with the speed and scalability of ARMs via a minimal linear fusion module (Israel et al., 9 Feb 2025). This approach outperforms discrete diffusion models on masked infilling benchmarks.
A summary of inference strategies:
Inference Method | Parallelism | Speedup over baseline | Guarantee of true joint |
---|---|---|---|
Ancestral (sequential) | None | 1× | Yes |
Predictive Sampling | Partial | 3–28× | Approximate, model-dependent |
ASSD (AS-ARM) | Full | 2–3× | Yes |
Blockwise/Batched Decoding | Batch-level | high | Yes |
4. Model Expressivity, Limitations, and Alternatives
Despite their empirical success and computational tractability, ARMs are theoretically limited:
- Expressivity Limitations: The requirement that all conditionals are efficiently computable (polytime, compact parameters) implies that ARMs cannot model sequence distributions where the local conditionals are NP-hard or globally constrained (Lin et al., 2020). This restriction means ARMs cannot represent certain globally coherent languages or enforce hard logical constraints.
- Energy-Based Models (EBMs): Relaxing local normalization, EBMs define unnormalized joint distributions, enabling modeling of intractable local dependencies and global sequence properties but sacrificing efficient sampling and normalization (Lin et al., 2020, Wang et al., 2022). Residual EBMs further combine ARMs as base distributions with energy-based corrections.
- Latent-Variable ARMs: Augmenting ARMs with latent variables (light marginalizations) increases expressivity, supporting distributions with global dependencies, but typically render likelihood scoring intractable (Lin et al., 2020).
- Diffusion LLMs (DLMs): These models eschew sequential dependence in favor of iterative, globally parallel denoising steps. DLMs can match or outperform ARMs in terms of in-context learning and instruction following but historically suffer from higher computation costs and scalability issues as sequence length grows (Nie et al., 14 Feb 2025, Kim et al., 5 Oct 2025).
These trade-offs underline the ongoing relevance of ARMs for likelihood-based modeling, as well as the need for principled alternatives in domains where strict sequential dependencies are inappropriate or global coherence is paramount.
5. Applications Across Domains
ARMs are foundational for a broad spectrum of generative and probabilistic modeling tasks:
- Text Generation and LLMing: Left-to-right ARMs underpin LLMs (GPT-series, Llama, Phi, etc.) and achieve high performance in NLG, open-ended dialog, and code synthesis.
- Explicit Likelihood Modeling in Vision/Audio: PixelCNN and WaveNet are ARMs specialized for images and speech.
- Infilling and Editing: Extensions such as MARIA enable masked infilling, dynamic text editing, and code completion by enabling ARMs to use bidirectional context efficiently (Israel et al., 9 Feb 2025).
- Conditional Modeling and Inpainting: AO-ARMs support arbitrary conditional inference, permitting masked prediction and image inpainting across text, images, and tabular domains (Shih et al., 2022).
- Test-time Alignment and Reward-Guided Generation: Autoregressive reward models provide token-level reward guidance, enabling multi-objective alignment of LLMs with user preferences, even for frozen base models and across diverse objectives (Xu et al., 10 Oct 2024, Lin et al., 6 May 2025).
ARMs also support interactive AI systems via fast batched inference, scalable deployment with KV-caching, and fine-grained explainability through token-level attribution and counterfactual editing (Kamahi et al., 21 Aug 2024).
6. Architectural and Algorithmic Innovations
Ongoing research aims to overcome ARM limitations and broaden their deployability:
- Insertion-Based and Any-Order Models: ILMs and AS-ARMs introduce paradigms for flexible, out-of-order generation, supporting planning tasks and arbitrary-length masked infilling with stronger dependency modeling than left-to-right ARMs (Patel et al., 9 May 2025, Guo et al., 29 Apr 2025).
- Redundancy Reduction and Loss Reweighting: Recent AO-ARM work reduces factorization redundancy and aligns training loss weighting with inference frequency, yielding improved joint and marginal likelihoods (Shih et al., 2022).
- Hybrid ARM-Diffusion Architectures: ACDC integrates global sequence planning from ARMs with local error correction via diffusion models and memory-conditioned context, achieving superior multimodal output quality and robustness for story or video generation (Chung et al., 7 Oct 2024).
- Efficient Sampling and Distillation: Predictive sampling (Wiggers et al., 2020) and self-distillation through time enable faster ARM sampling, and approaches such as blockwise decoding in DLMs aim for ARM-like scalability (Deschenaux et al., 28 Oct 2024, Kim et al., 5 Oct 2025).
- Stack-Based Parse State Probing: Probing studies reveal that standard ARMs maintain implicit, incrementally updated representations akin to parse stacks, facilitating real-time syntactic disambiguation and offering new interpretable control points (Eisape et al., 2022).
7. Performance Scaling, Computational Considerations, and Future Outlook
ARMs are characterized by low arithmetic intensity during sequential decoding since generating each token depends on previous tokens and makes heavy use of KV caching, yielding memory-bound inference at long contexts (Kim et al., 5 Oct 2025). Despite these constraints, ARMs deliver superior throughput in batched inference and long-context scenarios compared to contemporary diffusion models, owing to effective cache utilization and sequence-parallel scaling.
Emergent directions include:
- Reducing ARM sampling latency via parallel speculative decoding, predictive forecasting, and hybridization with blockwise or diffusion procedures.
- Unified frameworks for arbitrary-conditional and parallel sampling leveraging AS-ARM and AO-ARM innovations.
- Integration of external embeddings (e.g., sentence, multimodal, or knowledge-based) and efficient fusion techniques for cross-domain robustness.
- Unified reward-driven alignment and dynamic preference-aware generation, enabled by single-ARM multi-objective architectures with parameter-efficient conditioning.
As diffusion LLMs mature and further bridge the efficiency-quality gap with ARMs, hybrid paradigms and principled performance trade-offs are likely to reshape sequence modeling. Nevertheless, the foundational role of autoregressive modeling in likelihood-based sequence generation, conditional inference, controllability, and performance-driven scaling remains central to both theoretical and practical advances in language technologies.