Neighboring Autoregressive Modeling (NAR)

Updated 17 June 2026

Neighboring Autoregressive Modeling (NAR) is a sequence generation paradigm that leverages spatial, temporal, and network neighbor relations to improve efficiency and quality.
It employs a novel factorization strategy that predicts groups of tokens in parallel based on proximity, reducing inference complexity and boosting throughput.
NAR is applied in diverse domains such as visual outpainting, network time series, and language tasks, combining local dependency modeling with hybrid AR elements for robust performance.

Neighboring Autoregressive Modeling (NAR) encompasses a set of paradigms for sequence modeling and generation in diverse modalities—text, vision, speech, and networked systems—where the autoregressive dependencies are restructured to exploit locality, parallelism, and explicit neighbor relations. In contrast to standard left-to-right (chain rule) autoregressive models and fully independent (non-autoregressive) approaches, NAR leverages the structural or spatial-temporal neighborhood of tokens or nodes to improve efficiency, robustness, and sometimes output quality.

1. Conceptual Foundations and Probabilistic Factorization

Neighboring Autoregressive Modeling (NAR) operates at the interface between two extremes: fully autoregressive (AR) models, which factorize the joint distribution as $p_{AR}(x_1,\ldots,x_n) = \prod_{i=1}^n p(x_i|x_{<i})$ , and non-autoregressive (NAR) models, which factorize as $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ or, more generally, as conditionally independent given the input (e.g., $p(y_t|x)$ in sequence-to-sequence tasks) (Ai et al., 25 Sep 2025, Ren et al., 2020, Xiao et al., 2022). NAR introduces dependency structures based on proximity or explicit network relations, restructuring the decomposition to favor prediction of tokens/nodes based on their spatial, temporal, or structural neighbors.

For example, in visual generation, NAR factorization is performed by grouping tokens according to ascending Manhattan distance from a seed, so that all tokens at distance $d$ are predicted in parallel, conditioned on those at distances $<d$ (He et al., 12 Mar 2025). In networked time series, NAR parameterizes the evolution of node $i$ as a function of its own past and pasts of its explicit neighbors, with flexible weighting (Yin et al., 2021, Chen et al., 2020).

2. Visual NAR: Locality-Preserving Outpainting

In visual autoregressive modeling, vanilla approaches flatten images or videos into 1D sequences, applying next-token prediction left-to-right, top-to-bottom, ignoring the fundamental spatial locality. Neighboring Autoregressive Modeling (NAR) for vision imposes an outpainting process: generation progresses from a seed token, expanding in shells of constant spatial (or spatiotemporal) Manhattan distance (He et al., 12 Mar 2025). At each step $d$ , the set $S_d$ of tokens at distance $d$ from the seed are generated in parallel, conditioned only on tokens at $S_{<d}$ .

This factorization preserves exact autoregressive semantics:

$p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 0

where $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 1 are token sets equidistant from the initial seed. NAR's inference complexity reduces from $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 2 (for $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 3 images) to $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 4 forward passes. This is achieved via dimension-oriented decoding heads (horizontal, vertical, temporal), each responsible for distinct directions of outgrowth. When a token can be predicted from multiple directions, the outputs are ensembled by mixing logits. Empirical results on ImageNet-256 and UCF-101 show NAR surpasses both raster-order AR and parallel block-based auto-regressors (PAR-4X) in throughput (2.4–8.6×) and in FID/FVD scores (He et al., 12 Mar 2025).

3. NAR in Structured Sequence and Network Models

In networked dynamic systems, the term "Neighboring Autoregressive" or "Network Autoregressive" (network NAR) refers to models where each node's value depends both on its own historical states and those of directly connected nodes (Yin et al., 2021, Yin et al., 2024, Chen et al., 2020). For node $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 5 at time $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 6:

$p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 7

where $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 8 are elements of a normalized adjacency matrix. This approach generalizes classical vector autoregression (VAR) by encoding explicit neighbor relations, facilitating the modeling of spatial/temporal spillovers and heterogeneity.

Extensions include Community NAR (CNAR), wherein block structures capture intra- and inter-community effects (Chen et al., 2020), and Functional Coefficient NAR (FCNAR), allowing coefficients to vary nonlinearly with regime variables (Yin et al., 2024). Stationarity is analyzed via spectral radius conditions on companion matrices; estimation frameworks include OLS, GLS, ridge, and two-stage weighted methods, with rigorous theoretical guarantees (Yin et al., 2021, Chen et al., 2020).

4. NAR for Efficient LLM Reasoning

In cognitive and language tasks requiring multi-step reasoning, a hybrid AR–NAR framework leverages the benefits of both paradigms. Specifically, "Parallel Thinking, Sequential Answering" (Ai et al., 25 Sep 2025) decouples the high-level plan (reasoning trace) from the final answer surface realization. The system operates as follows:

A Discrete Diffusion LLM (the NAR "Mercury Coder") generates an explicit intermediate trace $p_{NAR}(x_1,\ldots,x_n)=\prod_{i=1}^n p(x_i)$ 9 in a sequence of $p(y_t|x)$ 0 parallel denoising steps:

$p(y_t|x)$ 1

for all positions $p(y_t|x)$ 2 in parallel at each diffusion step.

This plan $p(y_t|x)$ 3 is prepended to the problem prompt and provided to a powerful AR decoder (e.g., GPT-5), which then produces the final output $p(y_t|x)$ 4 token by token, sequentially maximizing $p(y_t|x)$ 5.

The NAR-generated reasoning trace mitigates long-horizon planning errors by global parallel refinement, while the AR stage focuses on fine-grained correctness and fluency. This division yields a 26 percentage point lift in pass@1 on combined reasoning benchmarks and a 30–40% wall-clock reduction in generation time compared to pure AR (Ai et al., 25 Sep 2025).

5. NAR in Speech and Text: Parallel Sequence Generation

Neighboring and non-autoregressive paradigms dominate recent work in efficient sequence generation beyond vision and networks, notably in speech recognition and neural machine translation (NMT). Here, NAR factorization assumes conditional independence among (output) tokens given the input. Canonical approaches include:

CTC-based models (Connectionist Temporal Classification), which align input frames to target tokens under monotonicity, enabling parallel decoding (Higuchi et al., 2021, Gao et al., 2022).
Masked LLM refinement (Mask-CTC, CMLM), where tokens are initially hypothesized in parallel (possibly with masks for uncertain positions), and filled via successive passes (Higuchi et al., 2021, Gong et al., 2022).
Iterative edit-based models (Levenshtein Transformer, NeighborEdit), where output hypotheses are refined in a small number of parallelized edit operations, often leveraging nearest-neighbor initialization in the latent space to guide NAR decoders and reduce iteration count without sacrificing output quality (Niwa et al., 2022).

NAR's critical challenge is modeling target-side dependencies—substitution and deletion errors are common when dependencies are ignored. Techniques to address this include glancing LLM samplers, knowledge distillation from AR teachers, and auxiliary alignment constraints. For instance, the Paraformer architecture introduces a continuous integrate-and-fire predictor for length estimation and a GLM sampler for partial conditioning, attaining AR-level error rates at up to 12× speedup (Gao et al., 2022). Mask-CTC with AR-to-NAR knowledge distillation further reduces the gap, with sub-10× model size and moderate ∼1% absolute WER/CER loss (Gong et al., 2022).

6. Theoretical and Practical Properties, Limitations

NAR methods offer massive decoding speedup—single- or few-pass parallel decoders can outperform AR models in throughput by one or more orders of magnitude (He et al., 12 Mar 2025, Gao et al., 2022, Xiao et al., 2022). However, the independence assumption can degrade output quality when output tokens exhibit strong interdependence, as quantified by the attention density ratio $p(y_t|x)$ 6 in target-masked "CoMMA" analysis (Ren et al., 2020). In tasks with high token dependency (e.g., ASR), pure NAR models still lag AR models in accuracy unless supplemented with distillation, alignment, or auxiliary AR rescoring.

Structural NAR models (e.g., network time series) hinge on accurate specification of neighbor relations and careful control of stability (spectral radius). Empirically, these models yield improved predictive performance when local dependencies are strong and community or covariate heterogeneity is present (Yin et al., 2021, Chen et al., 2020).

In vision, locality-preserving NAR achieves near-optimal tradeoffs between efficiency and perceptual quality for image/video generation, whereas naive flattening-based AR or block-based NAR remains suboptimal (He et al., 12 Mar 2025).

7. Perspectives and Research Directions

Neighboring Autoregressive Modeling, broadly construed, continues to drive advances in high-efficiency sequence generation by bridging locality, dependency, and parallelism. Key frontiers include:

Extending NAR beyond strict chain- or raster-based orderings to arbitrary graphs and compositional structures, such as arbitrary attention masks and proximity constraints relevant to hierarchical or temporal data.
Further integrating NAR and AR paradigms, as in Mercury-style think–answer division, to exploit global parallel reasoning with local sequential fidelity (Ai et al., 25 Sep 2025).
Designing more expressive NAR variants capable of capturing target-side dependencies without iterative or AR components, possibly via richer latent variable formulations, alignment mechanisms, or diffusion processes (Xiao et al., 2022, Ai et al., 25 Sep 2025).
Formalizing information-theoretic and statistical tradeoffs in dependency, parallelism, and data regularity using frameworks like CoMMA (Ren et al., 2020).
Deploying NAR in domains beyond current focus areas, such as code, large-scale knowledge graphs, and symbolic reasoning, especially where neighborhood or graph structure is intrinsic.

A persistent challenge remains: achieving fully-parallel, high-quality NAR generation in settings with inherently strong output dependencies, without reliance on heavy AR teacher supervision or multi-stage inference (Xiao et al., 2022). This motivates continued research in model architecture, loss design, and hybrid inference strategies.

Key references: (He et al., 12 Mar 2025, Ai et al., 25 Sep 2025, Yin et al., 2021, Xiao et al., 2022, Ren et al., 2020, Gao et al., 2022, Niwa et al., 2022, Chen et al., 2020, Yin et al., 2024, Gong et al., 2022, Higuchi et al., 2021)