Autoregressive Policies

Updated 21 April 2026

Autoregressive policies are sequential decision-making models that factorize action distributions using past actions and contextual inputs, ensuring temporal coherence.
They encompass architectures from next-token predictors to hierarchical, coarse-to-fine models like CARP, Dense Policy, and HiFlow, widely applied in imitation learning and control.
Training strategies such as supervised pretraining, online fine-tuning, and iterative retraining improve sample efficiency and inference speed while managing policy incoherence.

Autoregressive policies are a class of sequential decision-making models in which the policy selects each action by conditioning on the complete history of previous states, actions, and possibly auxiliary information such as goals or context. Formally, an autoregressive policy for horizon $T$ specifies the joint distribution over action sequences as a product of conditional distributions, each providing the probability (or density) of action $a_t$ given all past actions $a_{1:t-1}$ and relevant observations or goals. Autoregressive policies have become foundational in imitation learning, reinforcement learning (RL), and control, spanning discrete and continuous action spaces, and powering state-of-the-art models for robotic manipulation, vision-language-action integration, and foundation model-based decision making.

1. Mathematical Structure of Autoregressive Policies

The core principle underlying autoregressive policies is the chain rule factorization of action distributions: $p(a_{1:T} \mid \mathrm{context}) = \prod_{t=1}^{T} p(a_t \mid a_{1:t-1}, \mathrm{context})$ where the context may include states, observations, goal variables, or other conditional information. In supervised imitation learning (behavior cloning), models are typically trained to maximize: $\mathcal{L}(\theta) = -\mathbb{E}_{\mathrm{data}} \left[ \sum_{t=1}^T \log\, p_\theta(a_t \mid a_{1:t-1}, \mathrm{context}) \right]$ as in Decision Transformers (Chen et al., 2024), AR-VLA (Hu et al., 10 Mar 2026), and standard autoregressive actor-critic pipelines (Korenkevych et al., 2019).

In continuous control, autoregressive stochastic processes such as AR–p Gaussian processes are leveraged for action noise: $x_t = \sum_{i=1}^p \phi_i x_{t-i} + \sigma \varepsilon_t,\quad \varepsilon_t \sim \mathcal{N}(0,1)$ and the underlying policy outputs

$a_t = \mu_\theta(s_t) + \sigma_\theta(s_t) x_t$

with carefully designed coefficients $\{\phi_i\}$ to guarantee stationarity and standard-normal marginals (Korenkevych et al., 2019).

Modern policies further build coarse-to-fine latent hierarchies, mixing autoregressive prediction over scales (CARP (Gong et al., 2024), HiFlow (Yashima et al., 28 Mar 2026), Dense Policy (Su et al., 17 Mar 2025)) or augmenting context with language and visual embeddings, as in AR-VLA (Hu et al., 10 Mar 2026).

2. Classes and Instantiations of Autoregressive Policies

Next-Token and Next-Chunk Policies: Early autoregressive RL methods focus on modeling the next action or a fixed-length action chunk given history, enforcing strict causality but potentially limited temporal receptive field. Such forms are optimal for simplicity and integration with transformer decoders (e.g., Decision Transformers (Chen et al., 2024), ARP (Korenkevych et al., 2019)).

Coarse-to-Fine Latent Autoregressive Policies: Recent work such as CARP (Gong et al., 2024), HiFlow (Yashima et al., 28 Mar 2026), and Dense Policy (Su et al., 17 Mar 2025) adopts hierarchical sequence factorization. CARP leverages a VQ-VAE to encode multi-scale latent representations of action sequences and autoregresses over scales: $p(\mathbf{R} \mid \mathbf{s}) = \prod_{k=1}^K p(r_k \mid r_{1:k-1}, \mathbf{s})$ Dense Policy introduces bidirectional refinement via encoder-only architectures, recursively upsampling and refining action sequences logarthmically in time, achieving global temporal coherence with sublinear inference (Su et al., 17 Mar 2025). HiFlow eliminates tokenization entirely, directly modeling continuous latent targets at all scales and leveraging flow matching for efficient coarse-to-fine sampling (Yashima et al., 28 Mar 2026).

Autoregressive Action Experts for Vision-Language-Action: AR-VLA (Hu et al., 10 Mar 2026) maintains a hybrid key-value memory of proprioceptive and vision-language features, enabling fully context-aware autoregressive generation in asynchronous, robotics settings. Dynamic temporal re-anchoring ensures synchrony between fast control and slow perceptual modalities.

Goal-Conditioned Autoregressive Policies: Policies condition not just on task context or past actions, but on explicit goal variables, supporting flexible multi-task and zero-shot generalization. Such factorization is mathematically

$\pi(a_{1:T} \mid s, g) = \prod_{t=1}^T \pi(a_t \mid s, g, a_{1:t-1})$

but inducing potential incoherence, i.e., misalignment between predicted continuation and actual policy rollouts (Karwowski et al., 8 Oct 2025).

Stationary Autoregressive Stochastic Policies: In continuous RL, AR–p processes as noise models produce smooth explorations critical for high-frequency control and hardware safety, while maintaining compatibility with standard policy gradient and off-policy schemes (Korenkevych et al., 2019).

3. Algorithms and Training Paradigms

Supervised Autoregressive Pretraining: Decision Transformer and SAD (Chen et al., 2024) employ a causal transformer, maximizing the log-likelihood of next actions conditioned on state-action-return context windows. Empirically, this supports in-context reinforcement learning and zero-shot adaptation, provided careful curation of the pretraining distribution or trust horizon.

State-Action Distillation from Random Policies: SAD (Chen et al., 2024) demonstrates that a dataset constructed solely from random policy rollouts—augmented by selecting outstanding state-action pairs within a trust horizon—supports effective pretraining. The trust horizon is an $a_t$ 0 step window wherein the best action under random rollouts matches the truly optimal initial action. Under suitable assumptions, models pretrained with SAD achieve regret and generalization bounds matching algorithms that rely on well-trained or optimal behavior policies.

Coarse-to-Fine Autoregressive Training: CARP, Dense Policy, and HiFlow train either via:

VQ-VAE encoding (CARP), followed by cross-entropy supervised prediction at each scale (Gong et al., 2024).
Direct L2 losses over continuous latent scales, avoiding tokenization (HiFlow) (Yashima et al., 28 Mar 2026).
Recursive encoder refinement (Dense Policy), leveraging cross-attention with the observation encoder, and minimizing layer-wise squared error (Su et al., 17 Mar 2025).

Online Fine-tuning and Self-Coherence: Incoherence, arising when the next-token predictor is not trained on its own rollout distribution, is mitigated by iterated retraining (self-play), control-as-inference, or inference-time temperature annealing (Karwowski et al., 8 Oct 2025). Explicit monitoring and adaptation via reward folding or temperature scheduling correct this bias.

4. Theoretical Guarantees and Analysis

Sample Efficiency and Generalization: SAD (Chen et al., 2024) proves that with adequate trust horizons and exact fit assumptions, in-context RL policies achieve online regret of

$a_t$ 1

matching prior posterior sampling bounds.

Temporal Coherence and Exploration Quality: ARP (Korenkevych et al., 2019) analytically guarantees that AR–p policies produce stationary, standard-normal marginals with tunable autocorrelation. Experiments, e.g., on UR5 Reacher 2D, show that increasing AR order and autocorrelation parameter $a_t$ 2 reduces mean jerk, peak torques, and collision count per episode, facilitating smoother, safer exploration.

Inference Complexity: CARP and Dense Policy demonstrate that hierarchical, scale-wise autoregressive inference requires only $a_t$ 3 sequential refinement steps, compared to $a_t$ 4 for classical next-token autoregressive prediction or $a_t$ 5 steps for iterative diffusion (Su et al., 17 Mar 2025, Gong et al., 2024).

Incoherence and Effective Horizon: (Karwowski et al., 8 Oct 2025) formalizes the KL-divergence between naively goal-conditioned policies and their own soft-Q optimal equivalents as incoherence. Iterative retraining or reward folding provably reduces incoherence and increases expected return, with annealing temperature controlling the effective planning horizon.

5. Empirical Performance and Comparative Evaluation

Robotic Manipulation and Control Benchmarks: Across simulated and real-world robotic manipulation tasks, coarse-to-fine autoregressive policies (CARP, Dense, HiFlow) consistently match or surpass diffusion-based and non-AR baselines, achieving higher or comparable success rates while reducing inference latency and compute by an order of magnitude (Gong et al., 2024, Yashima et al., 28 Mar 2026, Su et al., 17 Mar 2025).

Method	Success Rate (MimicGen avg.)	Inference Speed	Key Features	Source
CARP	0.85	6.9 s	VQ-VAE + AR Transformer, multi-scale tokens	(Gong et al., 2024)
HiFlow	0.88	5.5 s	Flow-matching, continuous coarse-to-fine AR	(Yashima et al., 28 Mar 2026)
DensePolicy	72% (MetaWorld avg.)	~1 ms	Bidirectional, log-time, encoder-only	(Su et al., 17 Mar 2025)
DecisionT	—	—	Next-token AR, GPT2-style transformer	(Chen et al., 2024)

Vision-Language-Action Robotics: AR-VLA establishes that standalone autoregressive action experts integrated with heavy vision-language perception outperform chunk-based approaches, yielding smoother trajectories, superior history-awareness, and higher task success with lower latency, especially in asynchronous, real-world settings (Hu et al., 10 Mar 2026).

Continuous Control RL: ARP demonstrates increased sample efficiency, improved learning speed (up to $a_t$ 6 faster in sparse settings), and reduced unsafe behaviors compared to diagonal Gaussian exploration, especially at high control rates (Korenkevych et al., 2019).

6. Extensions, Limitations, and Practical Considerations

Scalability: Both CARP and HiFlow architectures enable multi-task and multi-modal conditioning via embedding concatenation, allowing flexible scaling to large datasets and complex action spaces. However, for very long-horizon or combinatorial tasks, strict next-token AR may become myopic without latent or bidirectional enhancements (Gong et al., 2024, Su et al., 17 Mar 2025).

Tokenization vs. Token-Free Approaches: While CARP requires learned discrete tokenizers (VQ-VAEs), HiFlow demonstrates that low-dimensional action spaces permit direct, continuous, tokenization-free autoregressive refinement, eliminating quantization error and pipeline complexity (Yashima et al., 28 Mar 2026).

Mitigating Incoherence: Goal-conditioning and next-step factorization can induce policy incoherence. This must be addressed by retraining or regularization; naive deployment can otherwise degrade return, especially in long-horizon or real-world domains (Karwowski et al., 8 Oct 2025).

Real-World Policy Deployment: Empirical results indicate that autoregressive policies—when properly equipped with context memory, hierarchical structures, or bidirectional refinements—enable safe, efficient deployment in both high-frequency and asynchronous environments (Korenkevych et al., 2019, Hu et al., 10 Mar 2026).

7. Broader Implications and Future Directions

The autoregressive policy paradigm now underpins nearly all major advances in scalable, compositional, and interpretable sequence decision making for robotics, RL, and vision-language-action agents. Ongoing frontiers include:

End-to-end unification of tokenization and prediction (as suggested by HiFlow (Yashima et al., 28 Mar 2026)).
Tightening regret and generalization bounds for in-context autoregressive RL with weak or random data (Chen et al., 2024).
Bidirectional, coarse-to-fine, or keyframe-based action refinement for further improvements in coherence and speed (Su et al., 17 Mar 2025).
Integration with foundation vision-LLMs, leveraging context-aware AR experts for closed-loop, multimodal control (Hu et al., 10 Mar 2026).
Unified treatment of policy incoherence, effective horizon, and scaling regimes incorporating reward folding or self-aligned retraining (Karwowski et al., 8 Oct 2025).

Autoregressive policies thus represent a universal, extensible formalism bridging classic RL exploration, sequential generative modeling, hierarchical control, and embodied intelligence across both simulated and real-world domains.