Parallel Context-of-Experts Decoding (PCED)

Updated 17 January 2026

The paper introduces PCED, a decoding framework that aggregates evidence from parallel expert caches using a retrieval-aware contrastive fusion rule.
PCED achieves efficient multi-document reasoning with up to 180× latency improvements and competitive performance on benchmarks like NQ and HotpotQA.
The method bypasses full-context attention bottlenecks but relies on full logits access and high-quality retrieval scores to operate effectively.

Parallel Context-of-Experts Decoding (PCED) is a framework developed for efficient and effective retrieval-augmented text generation (RAG). It establishes a method for aggregating evidence across multiple retrieved documents without the computational or architectural overhead of creating a joint long-context attention map. PCED operates at decoding time through an ensemble of independent "expert" contexts, with next-token distributions synchronized using a retrieval-aware contrastive objective. This design achieves strong multi-document reasoning capabilities and substantial latency improvements over standard long-context approaches and prior ensemble methods (Corallo et al., 13 Jan 2026, Liu et al., 2021).

1. Motivation and Problem Formulation

In conventional retrieval-augmented generation, there is a fundamental trade-off between supporting multi-document reasoning and achieving inference efficiency. Concatenating all retrieved documents into a single prompt enables the model's self-attention to freely integrate information ("multi-hop") across documents, which is essential for difficult QA, few-shot learning, and reasoning tasks. However, self-attention over a very long context ( $L_{\mathrm{total}} \approx \sum_{k=1}^K L_k + L_{\mathrm{query}}$ ) incurs $O(L_{\mathrm{total}}^2)$ time and $O(L_{\mathrm{total}} \cdot d)$ memory during the prefill step, presenting major latency bottlenecks as the number of documents ( $K$ ) or their length increases. Additionally, models struggle with "lost-in-the-middle" effects, where crucial information is buried in very long contexts.

An alternative approach encodes each retrieved document into a separate key/value (KV) cache, which can be precomputed offline. This allows rapid time-to-first-token (TTFT) at inference, as the model only needs to process the query and fetch relevant KV caches. However, cross-document attention is never constructed, breaking the model’s ability to aggregate information across documents and severely limiting multi-document and multi-hop capabilities (Corallo et al., 13 Jan 2026).

2. Framework Overview

PCED is a training-free decoding algorithm designed to recover multi-document reasoning without forming a global attention map. The method applies two key ideas:

Each retrieved document is treated as an isolated "expert," using its own frozen KV-cache.
Aggregation of information is performed during decoding by synchronizing the independent next-token proposals from all experts via a retrieval-aware, contrastive fusion rule.

High-Level Pipeline

Datastore Construction: Offline, for each document $d_i$ in the corpus, an embedding $e_i$ and precomputed KV-cache $K_i$ are created and stored.
Retrieval: At inference, the top- $N$ documents $d_1, ..., d_N$ are retrieved with normalized relevance scores $r_1, ..., r_N \in (0,1)$ .
Expert Initialization: $O(L_{\mathrm{total}}^2)$ $O (L_{total}^{2})$ 0 parallel experts are instantiated:
- Expert 0 ("amateur"): No context cache, representing the model prior ( $O(L_{\mathrm{total}}^2)$ 1).
- Experts $O(L_{\mathrm{total}}^2)$ 2 ("contextual experts"): Each with its own $O(L_{\mathrm{total}}^2)$ 3 cache.
Autoregressive Decoding: At every timestep $O(L_{\mathrm{total}}^2)$ 4, each expert proposes a next-token distribution. These distributions are combined using the contrastive decoding rule to select the token for generation, and the same token is appended to all experts’ histories. By synchronizing histories, generated tokens implicitly merge evidence across documents over time (Corallo et al., 13 Jan 2026).

3. Retrieval-Aware Contrastive Decoding Rule

The core of PCED is a decoding-time logit fusion mechanism, parameterized as follows:

Let the vocabulary be $O(L_{\mathrm{total}}^2)$ 5, with $O(L_{\mathrm{total}}^2)$ 6 tokens.
Each expert $O(L_{\mathrm{total}}^2)$ 7 at time $O(L_{\mathrm{total}}^2)$ 8 produces logits $O(L_{\mathrm{total}}^2)$ 9; the "amateur" gives $O(L_{\mathrm{total}} \cdot d)$ 0.
Each expert $O(L_{\mathrm{total}} \cdot d)$ 1 has a normalized retrieval score $O(L_{\mathrm{total}} \cdot d)$ 2.
$O(L_{\mathrm{total}} \cdot d)$ 3 controls the strength of the contrastive correction; $O(L_{\mathrm{total}} \cdot d)$ 4 gates based on retrieval score.

For each expert $O(L_{\mathrm{total}} \cdot d)$ 5, the fused logits are:

$O(L_{\mathrm{total}} \cdot d)$ 6

$O(L_{\mathrm{total}} \cdot d)$ 7 is the contrastive term, up-ranking tokens favored by the contextual expert over the prior.
$O(L_{\mathrm{total}} \cdot d)$ 8 injects retrieval strength as a prior, suppressing irrelevant experts.

Token selection is performed by taking the maximum over all experts:

$O(L_{\mathrm{total}} \cdot d)$ 9

At $K$ 0, $K$ 1 is adaptively set via Jensen–Shannon divergence between each $K$ 2 and $K$ 3, then fixed. Empirical values of $K$ 4 and $K$ 5 are robust across models and tasks (Corallo et al., 13 Jan 2026).

4. Algorithmic Steps and Complexity

The autoregressive PCED decoding process is structured as:

Input: Query $K$ 6, frozen expert caches $K$ 7, relevance $K$ 8.
Initialization: Shared generation history $K$ 9. Compute amateur logits $d_i$ 0.
Decoding Loop (for $d_i$ 1):
- If $d_i$ 2, compute $d_i$ 3 from Jensen–Shannon divergence between $d_i$ 4 and $d_i$ 5.
- For each expert $d_i$ 6 in parallel: compute $d_i$ 7.
- Compute fused logits $d_i$ 8.
- For each token $d_i$ 9, take $e_i$ 0.
- Select and output $e_i$ 1. Append $e_i$ 2 to $e_i$ 3 and all expert caches.

Complexity Comparison:

Approach	Per-token Complexity	Prefill/TTFT
Full-context	$e_i$ 4 + $e_i$ 5	Extremely high (long context)
KV-cache only	$e_i$ 6	Very low
PCED	$e_i$ 7	Very low

Because $e_i$ 8 is moderate ( $e_i$ 9), PCED's online cost is competitive for large-scale RAG (Corallo et al., 13 Jan 2026).

5. Comparative Evaluation and Empirical Performance

PCED surpasses previous parallel and agentic approaches, and rivals or exceeds full-context attention in various RAG benchmarks:

LOFT RAG (Mistral-13B): PCED-Dense achieves 81 EM on NQ versus 76 for full-context, and 80 versus 38 for single-document configurations.
HotpotQA: PCED provides +9–19 EM over standard KV-merge (APE) and +8 over MapReduce, closely matching full-prompt performance.
LongBench (Qwen3-8B): Improves multi-doc QA by 5–8 absolute points over full context; code completion improves by +9 points.
Latency: Time-to-first-token (TTFT) is reduced by up to 180× (0.14s for PCED vs. 25.5s for full prompt at K=90); end-to-end latency for 65k tokens and 512 output tokens is 1.7× faster.

PCED's decode-time logit fusion enables multi-document "stitching" without joint attention, offering rapid latency and strong accuracy even as context size scales (Corallo et al., 13 Jan 2026).

6. Limitations and Operational Constraints

Three primary limitations are identified:

Access to Full Logits: PCED requires raw logits from each expert at every decoding step, excluding deployment via closed APIs or where only token samples or partial probabilities are available.
Dependence on Retrieval Quality: If the most relevant document is not retrieved ( $K_i$ 0), PCED cannot recover its evidence. The retrieval-aware prior can suppress distractors, but cannot hallucinate absent information.
FP16 Storage Overhead: Storing KV caches at scale can be expensive (e.g., 11GB for 1.2K documents with average length 74 tokens for an 8B-parameter model). PCED is best suited for relatively static, high-read-frequency corpora (Corallo et al., 13 Jan 2026).

PCED draws conceptual inspiration from product-of-experts models used for controlled generation (DExperts) (Liu et al., 2021), but specializes the paradigm to retrieval-augmented settings with dynamic, retrieval-weighted, and contrastively calibrated expert mixing.

Potential future directions include:

End-to-end training to learn expert selection and dynamic weighting via a gating network, removing reliance on external retrieval scores.
Hybrid attention-decoding architectures that support sparse cross-document attention for tractable scaling.
Extending the framework to support parametric and non-parametric mixtures, and to integrate explicit chain-of-thought expert streams.
Application to multi-attribute and hierarchical expert settings, as well as style transfer domains (Corallo et al., 13 Jan 2026, Liu et al., 2021).

In summary, PCED defines a robust, efficient decoding-time algorithm for multi-document retrieval-augmented generation, combining per-document experts via a retrieval-aware, contrastive fusion rule that maintains accuracy and dramatically improves inference scalability.

Markdown Report Issue Upgrade to Chat

References (2)

Parallel Context-of-Experts Decoding for Retrieval Augmented Generation (2026)

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Context-of-Experts Decoding (PCED).