Parallel Plan-Aware AR (APAR)

Updated 23 March 2026

Parallel Plan-Aware AR (APAR) is a framework that fuses parallel plan generation with autoregressive decoding to accelerate LLM inference while preserving output coherence.
It employs discrete diffusion and hierarchical plan construction to segment outputs, reducing latency and resource usage (e.g., 27% KV-cache reduction, 2–4× speedup).
APAR’s modular design improves reasoning tasks, with chain-of-thought accuracy advancing from 52% to 78%, enabling efficient and scalable LLM deployment.

Parallel Plan-Aware AR (APAR) refers to a suite of methodologies and architectures that integrate explicit parallel planning mechanisms with auto-regressive (AR) generation in LLMs, with the aim of achieving high-quality outputs at a fraction of the conventional AR decoding latency and resource cost. APAR emerges in two principal research lineages: (1) “Auto-diffusion + Autoregression” architectures for complex reasoning and chain-of-thought (CoT) tasks (Ai et al., 25 Sep 2025), and (2) “Auto-Parallel Auto-Regressive” decoding for general-purpose acceleration of LLM inference (Liu et al., 2024). Both paradigms exploit the observation that many textual outputs exhibit internal hierarchical or parallelizable structure that can be leveraged to accelerate generation without compromising output coherence or factuality.

1. Motivations, Bottlenecks, and Problem Formulation

Traditional AR decoding strictly generates tokens sequentially, with each token sampled conditioned on the full history of prior tokens. This design guarantees maximal output coherence but presents several bottlenecks:

Latency: Each generation step requires a full sweep through the model, with increasing attention span and KV-cache as the prefix grows. This is memory-bound when batch size is low and compute-bound when sequence length or batch size is high.
Inefficient use of parallel hardware: GPU/TPU parallelism is underutilized due to inherent sequential dependencies.

Non-autoregressive (NAR) decoders, such as those based on discrete diffusion models, can exploit positionwise independence to generate multiple tokens in parallel. However, these methods typically sacrifice output consistency, especially for tasks requiring long-range dependencies and reasoning, leading to factual and logical errors (Ai et al., 25 Sep 2025).

APAR frameworks address these limitations by synthesizing parallel planning (using either parallel NAR plans or explicit hierarchical segmentations) with conventional AR answer generation.

2. Architecture and Mechanisms: Auto-Diffusion + Autoregression (APAR for Reasoning)

The workflow in APAR for reasoning-intensive domains proceeds in two main phases (Ai et al., 25 Sep 2025):

2.1 Parallel Plan Generation via Discrete Diffusion

Forward process: Given a plan as a token sequence $x_0=(x_0^1,\ldots,x_0^L)$ , introduce a $T$ -step Markov noising process $q(x_t|x_{t-1})$ that applies position-wise randomization with schedule $\{\beta_t\}$ , typically linear or cosine.
Reverse process: A learned NAR transformer with parameters $\theta$ models $p_\theta(x_{t-1}|x_t)$ , denoising each token position independently via softmax over the vocabulary, conditioned on the global current noised sequence $x_t$ .
Training: The Mercury NAR model is trained on (question, gold-plan) pairs, optimizing denoising cross-entropy at uniformly sampled $t$ :

$L_{\text{NAR}}(\theta) = \mathbb{E}_{x_0\sim D,\,t\sim [1,T],\,x_t\sim q(x_t|x_0)}\left[ -\log p_\theta(x_{t-1}=x_0\mid x_t, t) \right]$

This process encourages the model to reconstruct human-annotated or teacher-generated CoT plans from partial information, capturing global and local structure.

2.2 Autoregressive Answer Generation

Contextualization: The denoised plan $\hat{x}_0$ is prompt-engineered into the context for a powerful AR model (e.g., GPT-5): $[\text{Question}]\,\langle\text{think}\rangle\,\hat{x}_0\,\langle\text{answer}\rangle$ .
AR generation: The AR model then samples answer tokens $\mathbf{y}=(y^1,\ldots,y^K)$ sequentially:

$p(\mathbf{y}|\hat{x}_0,\,\text{Question}) = \prod_{j=1}^K p(y^j|y^{<j},\,\hat{x}_0,\,\text{Question})$

Inference typically uses greedy or beam search.

Inference proceeds by first executing the full NAR diffusion in parallel (all positions updated per step), then invoking the AR answer decoder.

3. Architecture and Mechanisms: Auto-Parallel Auto-Regressive Decoding

This APAR variant integrates hierarchical planning and parallel AR decoding within a standard transformer, primarily to accelerate generation in general LLM serving contexts (Liu et al., 2024).

3.1 Hierarchical Plan Construction

Instruct-tuning: Training data is converted to “paragraph trees,” where nodes correspond to independent segments (e.g., list items, subparagraphs). Two control tokens ([Fork], [Child]) are introduced to demarcate hierarchical/parallelizable structure.
Representational structure: At test time, outputs are segmented into $K$ units $\mathcal{S}=\{s_1,\ldots,s_K\}$ , partitioned by index breakpoints ( $0=p_0<p_1<\cdots<p_K=L$ ) and organized into an explicit tree $\mathcal{T}$ , where each node has up to two pointers (first_child, next_sibling).

3.2 Parallel Decoding Process

Core algorithm: Maintain a group $G$ of active sequences (threads). For each unfinished sequence, sample the next token using the model’s parameters. Upon outputting [Fork], a new decoding thread is spawned, inheriting the parent’s KV-cache and prepended with [Child]. All threads proceed independently, autoregressively, restricted to their designated segment.
Local AR, global parallelism: Within each segment, strict AR order is preserved; parallelism is realized between sibling segments.
Resource and step reduction: Because each thread decodes in parallel, total wall-clock steps are reduced by up to $K\times$ , where $K$ is the number of segments. Empirical $K\approx5$ yields a practical $\sim2\times$ wall-clock reduction due to overheads and non-parallelizable subtrees.

4. Computational and Quality Implications

4.1 Resource Usage

KV-cache: In AR, max cached tokens = prompt length $+$ generated length. In APAR, each branch’s cache is released immediately after its segment completes, reducing peak KV-cache usage (e.g., $-27\%$ on Vicuna) (Liu et al., 2024).
Attention computation: AR requires $O(L)$ attention for each token; APAR restricts attention to depth- $d_i$ paths in the tree, with empirical savings ( $-35\%$ on Vicuna, $-16\%$ on MT Bench).

4.2 Speed and Latency

Empirical results (general APAR): Up to $2\times$ end-to-end speedup with vanilla APAR; combined with speculative decoding, up to $4\times$ acceleration is observed (Liu et al., 2024).
Empirical results (reasoning APAR): NAR→AR yields pass@1 scores of $78\%$ versus $52\%$ for AR→AR, a $+26$ percentage point improvement, while achieving a $2{-}4\times$ reduction in inference latency relative to AR-only baselines (Ai et al., 25 Sep 2025).

4.3 Quality Maintenance

Quality, as measured by GPT-4–graded correctness, is invariant within a $\pm2\%$ band for general APAR settings, indicating no loss from parallelization in suitable categories (Liu et al., 2024). In rigorous reasoning tasks, the parallel-NAR plan with AR answer achieves both higher accuracy and efficiency (Ai et al., 25 Sep 2025).

5. Limitations, Trade-offs, and Task Suitability

Structural precondition: APAR’s acceleration is contingent on the presence of clear, parallelizable structure (e.g., bullet lists, subtasks). In tasks such as code or math where step-wise AR coherence is mandatory, APAR does not accelerate and reverts to single-thread AR decoding (Liu et al., 2024).
Per-fork overhead: Small $K$ or imbalanced segment sizes diminish net gains due to KV-cache copy and thread management overheads.
Choice of planning steps / tree depth: Empirically, intermediate values of diffusion steps $T$ or moderate tree depths yield the best cost–accuracy trade-off; excessive parameterization leads to diminishing returns.

A plausible implication is that APAR selectively amplifies throughput for applications dominated by hierarchically or structurally decomposable outputs, but does not extend blanket acceleration to all LLM tasks.

6. Extensions and Future Directions

Integration with schedulers and decoding methods: APAR can be combined orthogonally with paged attention, dynamic batching, and speculative decoding (e.g., “Medusa-APAR”) for further acceleration (Liu et al., 2024).
Model compression: The approach is compatible with quantization and pruning techniques.
Application to new tasks: Document summarization, machine translation (with clause- or sentence-level forking), and similar structured NLG tasks are identified as potential domains for APAR acceleration.
Richer parallel planning: Extension of the planning vocabulary (e.g., supporting deeper trees or n-way forks) may unlock even greater concurrency.
No end-to-end gradients: Notably, the APAR reasoning pipeline preserves module independence—no joint training of NAR and AR is needed—and this modularity enables easier adaptation and reuse of constituent models (Ai et al., 25 Sep 2025).

7. Representative Algorithmic Schematics

APAR for Reasoning (Discrete Diffusion + AR)

Algorithm APAR-Inference(question q):
    // PLAN GENERATION (NAR Diffusion)
    1. Initialize x_T ∼ Uniform over V^L
    2. for t = T downto 1:
         x_{t-1} ← argmax p_θ(x_{t-1} | x_t, t)
         x_t ← x_{t-1}
    3. \hat{x}_0 ← x_0
    // ANSWER GENERATION (AR)
    4. Prompt P ← “[q] 〈think〉 \hat{x}_0 〈answer〉”
    5. y* ← GreedyDecode_AR( p_φ(·|P) )
    6. Return y*

(Ai et al., 25 Sep 2025)

APAR Decoding for Hierarchical/Parallel Text Generation

function APAR_Decode(prompt p, Θ):
    G ← {p}               // active threads
    build root node r for p
    while ∃ s ∈ G not finished:
        for each s in G not finished:
            x ← Sample(Θ, s)
            if s.last_token == [Fork]:
                s' ← fork(s); s'.append([Child]); G ← G ∪ {s'}
            s.append(x)
            if x == [EOS]: free_KV(s); mark s finished
    return linearize_tree(r)

(Liu et al., 2024)

This paradigmatic enactment of parallel plan-aware AR offers a modular path to overcome the fundamental latency and scalability bottlenecks of LLM deployment, with documented gains in both efficiency and output quality under suitable data and task structure.

Markdown Report Issue Upgrade to Chat

References (2)

Parallel Thinking, Sequential Answering: Bridging NAR and AR for Efficient Reasoning (2025)

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Plan-Aware AR (APAR).