Any-Order Any-Subset AR Modeling (A³)

Updated 26 January 2026

Any-Order Any-Subset Autoregressive Modeling (A³) is a flexible probabilistic framework that factorizes joint distributions over arbitrary orders and subsets, surpassing fixed sequential models.
It supports versatile applications including masked language modeling, image infilling, bidirectional reasoning, and parallel decoding, effectively bridging AR and diffusion methodologies.
The MAC protocol minimizes redundancy by training a minimal set of conditionals, enhancing model efficiency and scalability for complex generative tasks.

Any-Order Any-Subset Autoregressive Modeling (A³) is a generalized probabilistic modeling framework that extends classical autoregression to support flexible generation, inference, and conditional density estimation over arbitrary variable groups and orderings. By moving beyond the fixed sequential decomposition of standard autoregressive models, A³ enables powerful applications such as masked language modeling, image infilling, bidirectional reasoning, and parallel decoding, while preserving the rigorous dependency modeling and tractability of traditional AR models. This approach subsumes left-to-right autoregression, permutation-invariant masked models, and two-group diffusion-style factorization as special cases, and connects to a broad spectrum of recent advances in language and vision generative modeling.

1. Generalized Autoregressive Factorizations: Model Definition

A³ models the joint distribution $P(x_1, ..., x_N)$ by factorizing it over arbitrary orderings and partitions of variables. The core idea is to allow both:

Any order: The ordering of variables (tokens, pixels, time steps) to be arbitrary rather than fixed.
Any subset (group): At each factorization step, any subset of the remaining variables may be handled jointly.

Formally, given a sequence $x_{1:N}$ and a partition into $K$ arbitrarily ordered groups $G_1, ..., G_K$ ,

$P(x_{1:N}) = \prod_{k=1}^{K} P(x_{G_k} \mid x_{G_{<k}})$

where $x_{G_k}$ denotes the variables in group $G_k$ , and $x_{G_{<k}} = \bigcup_{j<k} x_{G_j}$ is the union of all preceding groups (Du et al., 19 Jan 2026).

Special cases include:

Classical AR: $|G_k|=1$ for all $k$ (token-by-token, left-to-right or other fixed order).
Two-group (diffusion/masked) models: $x_{1:N}$ 0, e.g., predict all masked variables at once from their complement.
Arbitrary subset AR: Any group sizes and orderings; includes blockwise, setwise, and permutation-based variants (Liu et al., 2024).

This factorization unifies diverse modeling paradigms while permitting arbitrary conditional inference and efficient likelihood computation for any mask pattern or prompt.

2. Training Objectives, Redundancy, and the MAC Protocol

Full Any-Order Training

Prior A³ formulations average the autoregressive loss over all $x_{1:N}$ 1 permutations of variables,

$x_{1:N}$ 2

with $x_{1:N}$ 3 a permutation, and statistics accumulated over all possible orderings (Xue et al., 24 Jun 2025). In discrete diffusion/ELBO perspective, this becomes equivalent to training on all possible "unmasking" schedules (Xue et al., 24 Jun 2025).

Redundancy Issue and Minimal Conditionals

Modeling all $x_{1:N}$ 4 orderings is highly redundant: the same joint $x_{1:N}$ 5 is factorized in multiple, over-complete ways, leading to inefficiency and wasted capacity (Shih et al., 2022). For $x_{1:N}$ 6, distinct factorizations correspond to all paths from the empty set to the full set in the Boolean lattice, but all paths yield the identical joint probability.

Mask-tuned Arbitrary Conditional (MAC)

The MAC protocol resolves this by training only a minimal, non-redundant subset of univariate conditionals:

Represent the $x_{1:N}$ 7 masks (subsets) as nodes in a Boolean lattice.
Select for each nonempty node a single "decomposition" edge, typically removing the maximum element per a canonical ordering.
This ensures $x_{1:N}$ 8 unique conditionals suffice for full support of arbitrary conditional inference, dramatically reducing parameter redundancy (Shih et al., 2022).

The MAC objective re-weights the training loss for conditionals proportional to their frequency during inference, aligning training with test-time usage patterns. The loss is: $x_{1:N}$ 9 where $K$ 0 is the frequency-weighted mask distribution (Shih et al., 2022).

3. Architectural Realizations and Parallel Decoding

Two-Stream and Fully Masked Transformers

A³ factorization is realized in practice via specialized attention architectures:

Two-stream attention: Used for structured multi-group prediction. Each layer maintains content and query streams, with content attending up to and within the current group, and query streams restricted to prior groups only. This mechanism enforces the correct probabilistic dependencies and supports groupwise prediction (Du et al., 19 Jan 2026).
Fully Masked Transformer (FMT): For setwise AR (SAR), the encoder and decoder employ blockwise causal masks, enabling variable ordering and group sizes (blocks) (Liu et al., 2024).

Parallel and Dynamic Decoding

A³ enables:

Parallel groupwise sampling: Predict $K$ 1 tokens per group in each decode step, as in SAR blockwise or any-subset ARMs.
Dynamic/Adaptive decoding: At each step, unfinished positions can be adaptively grouped by highest confidence or lowest entropy for parallel update ("dynamic resampling") (Du et al., 19 Jan 2026).
ASSD (Any-Subset Speculative Decoding): A provably correct parallel decoding algorithm that drafts $K$ 2 tokens in parallel, evaluates their joint probability, then accepts or resamples, guaranteeing generation from the correct $K$ 3, with the number of model calls bounded by the number of tokens generated (Guo et al., 29 Apr 2025).

The architectural realization (encoder-only, decoder-only, hybrid) controls computational efficiency and the size of the conditional distribution space to model. Decoder-only A³ with KV-cache and efficient sampling achieves $K$ 4– $K$ 5 throughput advantages over encoder-only masked diffusion LMs (Xue et al., 24 Jun 2025).

4. Practical Algorithms and Training Strategies

Curriculum and Progressive Adaptation

Efficient A³ training typically involves a progressive schedule:

Singleton groups/AR: Begin with L2R AR factorization (groups of size one), enforcing standard AR masks.
Block/group expansion: Gradually increase group sizes, partitioning the sequence either contiguously or randomly.
Order permutation: Finally, randomize group order and intra-group membership to induce the full any-order, any-subset capacity (Du et al., 19 Jan 2026).

At all stages, the loss remains cross-entropy on true tokens, avoiding mask-based inefficiency.

Set Autoregressive Modeling (SAR) and Limit Cases

SAR extends A³ by allowing partition into blocks/sets of arbitrary size in any order:

Setwise AR loss: $K$ 6.
Special cases: $K$ 7 (classical AR), $K$ 8 (BERT-like or MAR), general $K$ 9 and randomness for intermediate regimes (Liu et al., 2024).

Choosing $G_1, ..., G_K$ 0 and partition randomness trades off sample quality (FID), speed, flexibility, and generalization to unseen orderings or mask patterns.

Pseudocode and Inference (Sketch)

Training and inference routines for A³: $G_1, ..., G_K$ 1 (Du et al., 19 Jan 2026, Liu et al., 2024, Guo et al., 29 Apr 2025)

5. Empirical Performance and Benchmarks

A³ consistently matches or outperforms discrete and continuous diffusion-based models for arbitrary conditional generation, infilling, and reasoning, while maintaining AR-like stability and scalability. Representative results:

Language and Reasoning Tasks

On QA, commonsense reasoning, and story infilling (LLaMA-8B, 2B tokens): | Model | TriviaQA | HellaSwag | Winogrande | PIQA | ROCStories ROUGE-1/2/L | |----------------|----------|-----------|------------|------|------------------------| | LLaMA-AR | 52.1 | 76.0 | 63.9 | 80.3 | 11.7 / 2.3 / 10.5 | | Dream-7B (DDM) | 18.3 | 26.9 | 51.8 | 55.8 | 11.7 / 2.3 / 10.5 | | DiffuLlama-7B | 18.5 | 58.7 | 56.4 | 63.3 | 23.3 / 5.5 / 21.2 | | A³-8B | 19.4 | 58.4 | 60.2 | 78.1 | 19.2 / 4.6 / 18.6 |

(Du et al., 19 Jan 2026)

Performance scales with model size per typical AR scaling laws; A³ closes the gap to AR on reasoning, while outperforming diffusion models on infilling.

Image Generation

On ImageNet-256:

FMT-XL SAR (random-16-random): FID ≈ 4.01, IS ≈ 250.3 in 64 steps, generalizes across orders and step budgets (Liu et al., 2024).
Classical AR: Best FID ≈ 2.76, but requires 4,096 sequential steps.
MAR: Lower FID (approx. 1.55) but cannot use KV-cache.

(Liu et al., 2024)

Conditional Likelihoods

On Text8, CIFAR-10, ImageNet32: | Model | Text8 (joint/marginal bpd) | CIFAR-10 (joint/marginal bpd) | ImageNet32 (joint/marginal bpd) | |--------------|----------------------------|-------------------------------|---------------------------------| | ARDM | 1.48 / 1.12 | 2.86 / 1.84 | 3.60 / 2.10 | | MAC (A³) | 1.40 / 1.09 | 2.81 / 1.81 | 3.58 / 2.08 |

(Shih et al., 2022)

Parallel Sampling Speed

ASSD parallel decoding reduces average forward calls and wall time by ≈49% for infilling tasks without loss of sample quality (Guo et al., 29 Apr 2025). Decoder-only A³ with KV-cache achieves a ∼25× speedup over encoder-only masked diffusion, with a modest perplexity gap erased by ensemble over context orders or temperature tuning (Xue et al., 24 Jun 2025).

6. Theoretical Properties, Limitations, and Connections

A³ provides:

Complete probabilistic rigor: All conditional and marginal probabilities are jointly consistent, supporting arbitrary pattern inference and generation.
Flexible conditioning: Bidirectional and nonsequential reasoning are tractable via mask-based decomposition and dynamic inference strategies.
Scalable architectures: With minimal redundancy (cf. MAC), efficient implementation is possible for high dimensions.
Provable correctness: Speculative decoding (ASSD) for any-subset AR is theoretically justified: the algorithm outputs joint samples from the correct distribution with no more network calls than tokens predicted (Guo et al., 29 Apr 2025).
Unified paradigm: A³ is mathematically equivalent to the discrete-time ELBO of masked diffusion models under arbitrary factor orderings and conditioning (Xue et al., 24 Jun 2025).

A key limitation is the combinatorial growth of distinct conditional distributions to be modeled in decoder-only architectures; encoder-only and MAC protocols address this via parameter sharing and careful objective design. The informativeness and difficulty of various permutations (orderings) present additional optimization considerations—weighting by usage frequency (as in MAC) and curriculum adaptation improve effective capacity alignment during training (Shih et al., 2022, Xue et al., 24 Jun 2025).

A³ subsumes classical subset/partial autocorrelation AR models in time series, as in the partial autocorrelation parameterization of subset autoregression (McLeod et al., 2016), generalizing to arbitrary lag sets and providing a statistically efficient likelihood maximization strategy for high-order series.

Recent developments demonstrate that structured multi-group AR (A³) unifies the scaling efficiency, flexibility, and sample quality of advanced diffusion and AR architectures, while supporting novel use cases—such as infilling, bidirectional conditioning, and semi-parallel decoding—across text, images, and continuous domains (Du et al., 19 Jan 2026, Liu et al., 2024, Guo et al., 29 Apr 2025).

The practical impact of A³ includes advancements in modeling arbitrary subsets of variables, principled speculative decoding for parallelism, and the development of hybrid AR-diffusion foundations models for vision and language, offering a rigorous alternative to pure diffusion-based approaches and opening new directions for efficient, flexible generative modeling.