Expanding Autoregressive Representation (EAR)

Updated 26 November 2025

Expanding Autoregressive Representation (EAR) is a deep learning framework that systematically enhances autoregressive models by expanding the generative, predictive, or receptive field in structured ways.
It employs center-outward spiral expansion and length-adaptive decoding to prioritize central details and improve efficiency in tasks like image synthesis.
EAR also extends to heterogeneous inference and convolutional architectures through ARMA layers and custom masking, yielding robust performance in diverse applications.

Expanding Autoregressive Representation (EAR) encompasses several methodologies within deep learning that systematically enhance autoregressive modeling by expanding the generative, predictive, or receptive field in a structured or adaptive manner. Notably, EAR design patterns have been instantiated in visual generation for scalable image modeling as well as in dense prediction networks and in general heterogeneous inference. Core variants include spiral expansion for visual transformers (Yang et al., 19 Nov 2025), ARMA-based receptive field expansion in convolutional networks (Su et al., 2020), and the extended autoregressive approach for arbitrary-variable inference (Zhou et al., 2018).

1. Center-Outward Spiral Expansion in Visual Autoregressive Modeling

EAR for visual generation establishes a center-outward, spiral factorization of an image’s token grid for autoregressive models. For an $n \times n$ tokenized image (e.g., via VQ-VAE), tokens are mapped using a bijection

$\mathrm{spiral}\colon\{1,\dots,n\}^2\to\{1,\dots,T\},\ T=n^2$

where the spiral proceeds in alternating directions (right, down, left, up), with increasing step lengths, starting from the image center. The joint probability of the quantized tokens in spiral order,

$p(c_{1:T}) = \prod_{t=1}^T p(c_t | c_{<t}),$

benefits from locality and spatial continuity, as tokens generated in each step are spatially adjacent.

This topology emulates human foveal perception, prioritizing spatially central, perceptually salient regions first. As the generation expands outward, the model can leverage well-formed contextual embeddings for rendering peripheral content. EAR contrasts with raster-scan or fixed block-wise orderings, which tend to induce artifact-prone boundaries and less perceptually aligned progressive synthesis (Yang et al., 19 Nov 2025).

2. Length-Adaptive Decoding for Efficient Generation

Rather than a fixed, single-token-per-step approach, EAR introduces a length-adaptive decoding schedule. For $K$ steps, the number of tokens emitted at step $k$ is $l_k$ , satisfying $\sum_{k=1}^K l_k = T$ . Typical settings include:

16-step: $l_k = 2k-1$ for $k=1,...,16$
31-step: $l_k = \lceil k/2 \rceil$ for $k=1,...,31$

Early steps produce few tokens (preserving quality for central regions), while later steps accelerate by predicting more tokens, thereby improving throughput without sacrificing core perceptual fidelity. Generation proceeds by appending masked tokens for the next $l_k$ positions, running a single forward pass of the transformer, and decoding all at once. This scheduling naturally aligns computational effort to perceptual importance (Yang et al., 19 Nov 2025).

3. Model Architecture, Causal Masking, and Inference

EAR models employ a decoder-only transformer backbone (LlamaGen-derived), incorporating rotary positional encodings (RoPE) and, optionally, adaptive layer normalization (AdaLN). The input sequence at each iteration concatenates a class token, the sequence of previously generated (ground-truth during training) tokens in spiral order, and repeated learnable mask tokens.

Each decoding step uses a custom causal mask:

Mask tokens can attend to all previously generated (ground-truth) tokens and to each other.
Each token’s context is precisely the tokens preceding it in spiral order.

A key-value cache mechanism ensures computation sharing across decoding steps, reducing redundancy and yielding near constant-time per step.

During training, the cross-entropy loss is computed over masked positions after each expansion step. Optimization uses AdamW, standard VQ-VAE and transformer augmentations, and no explicit curriculum beyond the fixed spiral order (Yang et al., 19 Nov 2025).

4. Quantitative Performance and Comparative Evaluation

EAR achieves state-of-the-art fidelity-efficiency trade-offs among single-scale autoregressive models. On ImageNet 256×256 with VQ-VAE tokenization, representative results (31 steps):

Model	FID ↓	IS ↑	Params	Steps	Time (s)	GFLOPs
EAR-B (98M)	4.64	218.7	98M	31	0.36	26.0
EAR-L (326M)	3.06	261.4	326M	31	0.69	93.5
EAR-XL (754M)	2.75	275.1	754M	31	1.03	220.6
EAR-XL (AdaLN)	2.54	262.7	1.1B	31	1.37	243.1

Comparisons: LlamaGen-XL (775M, 256 steps) achieves FID 3.39 at 8.08 s/step; VAR-d20 (600M, 10 steps) achieves FID 2.57 at 0.50 s/step, but at higher compute; DiT-XL/2 (675M, 250 steps) obtains FID 2.27 at 55.7 s/step. EAR-XL attains FID ≈ 2.75 in ≈1 s, significantly improving time–quality trade-off for single-scale models (Yang et al., 19 Nov 2025).

Further, ablation studies demonstrate that more decoding steps (finer-grained expansion) marginally improve FID, while the unified mask token outperforms class-derived variants. Qualitatively, EAR yields central details earlier, maintains perceptual alignment during extension tasks, and minimizes edge artifacts (Yang et al., 19 Nov 2025).

5. EAR in Receptive Field Expansion: The ARMA Layer

In dense prediction and convolutional architectures, Expanding Autoregressive Representation is instantiated by the ARMA layer. At each layer, outputs depend both on a moving-average (MA) convolution of the inputs and an autoregressive (AR) feedback over output locations. Formally, in 2D:

$A * Y = W * X,$

where $W$ is the MA convolution kernel, $A$ is the AR feedback kernel, $X$ is the input map, and $Y$ is the output. The effective receptive field (ERF) is provably expanded via nonzero AR coefficients:

$r_\text{ARMA}^2 = \sum_{\ell=1}^L \left( d_\ell^2 \frac{K_\ell^2 - 1}{12} + \frac{a_\ell}{(1 - a_\ell)^2} \right).$

As $a_\ell \to 1$ , $r_\text{ARMA} \to \infty$ , enabling the network to learn global context dynamically, in contrast with the sublinear ERF scaling in plain CNNs. Numerical stability is achieved through a constrained reparameterization of the AR kernel (e.g., $f_{\pm1}=(\tanh \beta)/\sqrt{2} \pm (\tanh \gamma)/\sqrt{2}$ ), which guarantees bounded-input bounded-output properties for any learned parameter values.

Empirically, ARMA layers in ConvLSTM and U-Net backbones outperform dilated and non-local alternatives for video prediction (Moving-MNIST) and semantic segmentation (ISIC 2018), with AR coefficients learned adaptively by task. For instance, the ARMA-LSTM (0.893M params) achieves PSNR 19.72 and SSIM 0.904 on Moving-MNIST-2, surpassing non-ARMA models with fewer parameters (Su et al., 2020).

6. Extended Autoregressive Models for Heterogeneous Inference

The EAR paradigm is generalized for heterogeneous inference with the “extended autoregressive model,” in which no fixed order is imposed and inference is required over arbitrary subsets of latent and observed variables. Instead of factorizing $p(x)=\prod_i p(x_i|x_{<i})$ , the EAR architecture encodes which variables are observed (masking unobserved variables to zero) and predicts the posterior marginal of each latent variable in parallel:

$\hat{X} = \text{EAR}(o; \theta),\quad o = \text{Mask} \odot X.$

The model comprises shared hidden layers with per-variable output “heads,” trained via a composite $\ell_2$ reconstruction and stability loss with $\ell_1$ regularization:

$L(\theta) = \alpha \| \text{Mask} \odot (X - \hat{X}) \|_2^2 + \beta \| (1 - \text{Mask}) \odot (X - \hat{X}) \|_2^2 + \gamma \|\theta\|_1.$

Theoretical results establish that this ensemble of discriminative predictors recovers the classical joint factorization as a special case. Empirical benchmarks on real-world Bayesian network datasets (e.g., Alarm, Asia, Child, Insurance, Survey, Win95pts) show that EAR attains the lowest absolute deviation and KL divergence and highest state-classification accuracy compared to RBM, WGAN, CGAN, VAE, and CVAE baselines. The adversarial extension (EARA) introduces a min–max game with a discriminator over variable vectors, slightly decreasing stability due to gradient conflicts compared to the nonadversarial EAR (Zhou et al., 2018).

7. Synthesis and Impact

Across instantiations, Expanding Autoregressive Representation encompasses:

Center-outward spiral autoregressive decoding for spatially coherent, efficient image synthesis, yielding superior time–fidelity trade-offs by systematically matching perceptual structure (Yang et al., 19 Nov 2025).
ARMA-based expansion of receptive fields for adaptive, task-optimal context aggregation in convolutional networks, with provable stability and efficient learning (Su et al., 2020).
Flexible frameworks for heterogeneous inference that eliminate a fixed variable ordering, scale to large graphs, and empirically outperform established generative models on structured inference tasks (Zhou et al., 2018).

EAR implementations share the capacity to adapt locality and global structure dynamically and to balance efficiency, stability, and expressivity. They provide both theoretical justification (via factorization or receptive field analysis) and strong empirical validation, demonstrating robustness across visual generation, dense prediction, and probabilistic inference domains.