Order-Agnostic ARMs

Updated 7 November 2025

OA-ARMs are probabilistic models that enable tractable inference on arbitrary variable subsets by training over all possible orderings.
The MAC protocol reduces redundancy from O(N!) to O(2^N) by deterministically selecting a canonical decomposition for each variable subset.
Empirical results demonstrate MAC's superiority across domains such as text (Text8), images (CIFAR10/ImageNet32), tabular data, and robotics.

Order-Agnostic Autoregressive Models (OA-ARMs) are a class of probabilistic models designed for tasks where conditional inference or generative modeling must be tractable and accurate for arbitrary subsets of variables, as opposed to requiring a fixed ordering such as left-to-right tokenization in language modeling. OA-ARMs generalize traditional autoregressive models by supporting efficient inference on all possible marginal or conditional distributions, and have gained prominence due to their applicability across domains such as masked language modeling, image inpainting, tabular data modeling, and complex structured generative tasks.

1. Fundamental Principles and Historical Context

Classical autoregressive models (ARMs) decompose the joint distribution over $N$ variables, $p(x)$ , as a product over one-dimensional conditional distributions dictated by some variable order: $p(x) = \prod_{t=1}^{N} p(x_{\sigma(t)} \mid x_{\sigma(<t)})$ where $\sigma$ is a permutation of $\{1, \dots, N\}$ . While fixed-ordering models (e.g., left-to-right for text) offer tractable generation and maximum likelihood estimation, they cannot efficiently support conditional inference over arbitrary observed/missing patterns.

To address this, OA-ARMs (adopted terminology: Any-Order Autoregressive Models, AO-ARMs) were developed, supporting probabilistic inference for any subset of variables by training over all possible orderings (i.e., over all $N!$ permutations). Notable early models include NADE, DeepNADE, and later ARDMs, as well as permutation-based approaches such as XLNet for language modeling.

Applications of OA-ARMs include but are not limited to masked language modeling (BERT), image inpainting, tabular missing data imputation, and conditional generative modeling in multimodal contexts.

2. Redundancy, Scalability, and the Need for Efficient OA-ARM Training

OA-ARM formulations such as NADE, ARDM, and ACE provide coverage of all conditional marginals by factorizing over all orderings. This introduces substantial redundancy:

The same conditional probability is modeled multiple times through different orderings (e.g., $p(x_j \mid x_{e \setminus j})$ appears in many different permutations for subset $e$ ),
The number of modeled conditionals grows super-exponentially with data dimensionality,
Limited model capacity is diffused over redundant conditionals, impairing fit and generalization.

The paper "Training and Inference on Any-Order Autoregressive Models the Right Way" (Shih et al., 2022) rigorously characterizes this redundancy by representing the space of conditionals on a binary lattice of all mask subsets, showing that efficient marginal inference requires learning only a non-redundant spanning set of univariate conditionals ("one incoming edge per mask node").

3. The MAC Protocol: Redundancy Elimination and Training Alignment

The Mask-tuned Arbitrary Conditional Model (MAC) framework introduces principled improvements to OA-ARM training and inference:

Deterministic Recursive Decomposition: For each variable subset (mask) $e$ , MAC deterministically selects a canonical variable to remove for decomposition (e.g., the largest by lexicographic or learned order). This ensures every marginal is decomposed uniquely, removing redundancy and reducing the set of learned conditionals from an order of $O(N!)$ to $O(2^N)$ —a reduction crucial for tractability.
Edge Upweighting by Expected Usage: MAC computes the expected frequency (induced edge distribution) with which each univariate conditional is traversed during marginal inference across the actual mask distribution $M$ (reflecting downstream query patterns). The loss for conditionals is upweighted according to this empirical frequency, correcting training/inference mismatches and improving sample efficiency on heavily-used conditionals.
Training Objective:

$\mathcal{L}_M(\theta) = -\mathbb{E}_{e\sim wM} \left[ \frac{1}{N-|e|} \sum_{i\in X\setminus e} \log p_\theta(x_i | x_e) \right]$

This expectation is over mask patterns and edges as induced by $M$ and the decomposition protocol $w$ , and (heuristically) includes cardinality-based reweighting to aid generalization.

Algorithmically, training alternates between sampling mask subsets according to $M$ , recursively decomposing according to $w$ , and updating the relevant univariate conditional heads. Testing runs the same recursive decomposition for efficient conditional likelihood or sampling computations.

4. Empirical Results: Text, Image, Tabular, and Robotic Domains

State-of-the-art empirical performance is demonstrated with MAC across diverse data modalities:

Text (Text8): MAC achieves the best marginal and joint log-likelihoods among arbitrary conditional models, approaching the performance of optimized one-order Transformers.
Images (CIFAR10, ImageNet32): MAC outperforms ARDM both in joint and marginal log-likelihood, even exceeding some "joint" (fixed-order) models on ImageNet32.
Tabular Data: Across five continuous tabular benchmarks, MAC consistently matches or surpasses the performance of strong baselines such as ACE, ACFlow, and SPFlow in both marginal and conditional likelihoods.
Robotics (FrankaKitchen shared autonomy): For conditional inference in filling missing action dimensions, MAC yields higher reward than independent behavior cloning (BC) policies and approaches the full-autonomy policy.

Domain	MAC Improvement (over prior OA-ARMs)
Text8 (char-level)	Best marginal/joint log-likelihood among arbitrary models
CIFAR10/ImageNet32	SOTA joint/marginal likelihood; competitive with best AR
Tabular (ACE)	Improved marg/cond. lik. on most splits
Robotic control	Better shared autonomy (action inference)

5. Advances in OA-ARM Theory, Optimization, and Extensions

OA-ARMs are closely related to masked models (BERT, MaskGIT) and recent diffusion-based generative models. Key recent findings and extensions include:

Recursive Decomposition as a Binary Lattice: Marginal likelihoods can be recursively factorized along a binary lattice of subsets, requiring only one spanning (non-redundant) set of conditionals.
Order-Policy Learning: Extensions such as Learning-Order ARMs (Wang et al., 7 Mar 2025) parameterize a state-dependent, trainable probability distribution ("order-policy") over variable orderings. This enables models to prefer and adaptively infer variable generation orders advantageous for a given data type (such as edge-first policies for molecular graphs).
Efficiency and Scalability: Orders of magnitude reduction in conditional/factor space enable the scaling to high-dimensional data (images, graphs, multi-modal, tabular) previously intractable for classical OA-ARMs.
Tractable Marginal and Conditional Inference: The reduction in redundancy, combined with computation-aligned training, ensures that OA-ARMs retain their capacity for efficient arbitrary conditional inference without compromising tractability.

6. Practical Implications, Deployment, and Current Limitations

Modern OA-ARMs with redundancy-removal and usage-aligned training (e.g., MAC) resolve key bottlenecks: reduced memory and computational cost, improved generalization, and theoretical guarantees of tractable arbitrary conditional inference. These advances make OA-ARMs prime candidates for:

Masked modeling and data completion in vision, language, tabular, or structured data domains,
Robotic policy completion under variable action constraints,
Flexible generative modeling where conditional queries are diverse and high-dimensional.

Deployment considerations include the selection/tuning of the decomposition protocol $w$ and the mask distribution $M$ to match actual inference scenarios. OA-ARM variants can further refine efficiency by learning the decomposition protocol, using data-driven or structure-aware ordering (e.g., learned entropy-minimizing policies for graphs).

However, in domains with strong intrinsic ordering (e.g., natural language), fixed-order ARMs or models that leverage explicit sequential dependencies can still outperform OA-ARMs, especially for long-range sequential modeling. The residual computational cost of training and storing all necessary conditional heads, while dramatically improved, may still be significant in extreme dimensions.

7. Outlook and Ongoing Research Directions

Order-agnostic autoregressive modeling is converging with related paradigms in masked modeling and discrete diffusion (e.g., ARDM, Masked Diffusion Models, as shown in (Zheng et al., 4 Sep 2024) and (Hoogeboom et al., 2021)), justifying joint training and efficient, parallel sampling over arbitrary maskings as maximum likelihood estimation.

Further research directions involve:

Learning adaptive or instance-wise order-policies for structured data domains,
Integration with architectural innovations (e.g., Transformers with feature identity encoding, as in DEformer (Alcorn et al., 2021)),
Exploiting prior knowledge about inference/query distributions for optimal resource allocation (see OA++ (Voisin et al., 2017)),
Extending OA-ARMs to continuous or hybrid domains and compositional multi-modal inference.

OA-ARMs have shifted from theoretical tools for universal density estimation to scalable, efficient models that underpin leading approaches in conditional generative modeling, masked representation learning, and flexible Bayesian inference across disciplines. The development of protocol-driven, usage-weighted training strategies marks a robust foundation for their practical, high-dimensional deployment.