Parallel Context-of-Experts Decoding (PCED)
- The paper introduces PCED, a decoding framework that aggregates evidence from parallel expert caches using a retrieval-aware contrastive fusion rule.
- PCED achieves efficient multi-document reasoning with up to 180× latency improvements and competitive performance on benchmarks like NQ and HotpotQA.
- The method bypasses full-context attention bottlenecks but relies on full logits access and high-quality retrieval scores to operate effectively.
Parallel Context-of-Experts Decoding (PCED) is a framework developed for efficient and effective retrieval-augmented text generation (RAG). It establishes a method for aggregating evidence across multiple retrieved documents without the computational or architectural overhead of creating a joint long-context attention map. PCED operates at decoding time through an ensemble of independent "expert" contexts, with next-token distributions synchronized using a retrieval-aware contrastive objective. This design achieves strong multi-document reasoning capabilities and substantial latency improvements over standard long-context approaches and prior ensemble methods (Corallo et al., 13 Jan 2026, Liu et al., 2021).
1. Motivation and Problem Formulation
In conventional retrieval-augmented generation, there is a fundamental trade-off between supporting multi-document reasoning and achieving inference efficiency. Concatenating all retrieved documents into a single prompt enables the model's self-attention to freely integrate information ("multi-hop") across documents, which is essential for difficult QA, few-shot learning, and reasoning tasks. However, self-attention over a very long context () incurs time and memory during the prefill step, presenting major latency bottlenecks as the number of documents () or their length increases. Additionally, models struggle with "lost-in-the-middle" effects, where crucial information is buried in very long contexts.
An alternative approach encodes each retrieved document into a separate key/value (KV) cache, which can be precomputed offline. This allows rapid time-to-first-token (TTFT) at inference, as the model only needs to process the query and fetch relevant KV caches. However, cross-document attention is never constructed, breaking the model’s ability to aggregate information across documents and severely limiting multi-document and multi-hop capabilities (Corallo et al., 13 Jan 2026).
2. Framework Overview
PCED is a training-free decoding algorithm designed to recover multi-document reasoning without forming a global attention map. The method applies two key ideas:
- Each retrieved document is treated as an isolated "expert," using its own frozen KV-cache.
- Aggregation of information is performed during decoding by synchronizing the independent next-token proposals from all experts via a retrieval-aware, contrastive fusion rule.
High-Level Pipeline
- Datastore Construction: Offline, for each document in the corpus, an embedding and precomputed KV-cache are created and stored.
- Retrieval: At inference, the top- documents are retrieved with normalized relevance scores .
- Expert Initialization: parallel experts are instantiated:
- Expert 0 ("amateur"): No context cache, representing the model prior ().
- Experts ("contextual experts"): Each with its own cache.
- Autoregressive Decoding: At every timestep , each expert proposes a next-token distribution. These distributions are combined using the contrastive decoding rule to select the token for generation, and the same token is appended to all experts’ histories. By synchronizing histories, generated tokens implicitly merge evidence across documents over time (Corallo et al., 13 Jan 2026).
3. Retrieval-Aware Contrastive Decoding Rule
The core of PCED is a decoding-time logit fusion mechanism, parameterized as follows:
- Let the vocabulary be , with tokens.
- Each expert at time produces logits ; the "amateur" gives .
- Each expert has a normalized retrieval score .
- controls the strength of the contrastive correction; gates based on retrieval score.
For each expert , the fused logits are:
- is the contrastive term, up-ranking tokens favored by the contextual expert over the prior.
- injects retrieval strength as a prior, suppressing irrelevant experts.
Token selection is performed by taking the maximum over all experts:
At , is adaptively set via Jensen–Shannon divergence between each and , then fixed. Empirical values of and are robust across models and tasks (Corallo et al., 13 Jan 2026).
4. Algorithmic Steps and Complexity
The autoregressive PCED decoding process is structured as:
- Input: Query , frozen expert caches , relevance .
- Initialization: Shared generation history . Compute amateur logits .
- Decoding Loop (for ):
- If , compute from Jensen–Shannon divergence between and .
- For each expert in parallel: compute .
- Compute fused logits .
- For each token , take .
- Select and output . Append to and all expert caches.
Complexity Comparison:
| Approach | Per-token Complexity | Prefill/TTFT |
|---|---|---|
| Full-context | + | Extremely high (long context) |
| KV-cache only | Very low | |
| PCED | Very low |
Because is moderate ($8-32$), PCED's online cost is competitive for large-scale RAG (Corallo et al., 13 Jan 2026).
5. Comparative Evaluation and Empirical Performance
PCED surpasses previous parallel and agentic approaches, and rivals or exceeds full-context attention in various RAG benchmarks:
- LOFT RAG (Mistral-13B): PCED-Dense achieves 81 EM on NQ versus 76 for full-context, and 80 versus 38 for single-document configurations.
- HotpotQA: PCED provides +9–19 EM over standard KV-merge (APE) and +8 over MapReduce, closely matching full-prompt performance.
- LongBench (Qwen3-8B): Improves multi-doc QA by 5–8 absolute points over full context; code completion improves by +9 points.
- Latency: Time-to-first-token (TTFT) is reduced by up to 180× (0.14s for PCED vs. 25.5s for full prompt at K=90); end-to-end latency for 65k tokens and 512 output tokens is 1.7× faster.
PCED's decode-time logit fusion enables multi-document "stitching" without joint attention, offering rapid latency and strong accuracy even as context size scales (Corallo et al., 13 Jan 2026).
6. Limitations and Operational Constraints
Three primary limitations are identified:
- Access to Full Logits: PCED requires raw logits from each expert at every decoding step, excluding deployment via closed APIs or where only token samples or partial probabilities are available.
- Dependence on Retrieval Quality: If the most relevant document is not retrieved (), PCED cannot recover its evidence. The retrieval-aware prior can suppress distractors, but cannot hallucinate absent information.
- FP16 Storage Overhead: Storing KV caches at scale can be expensive (e.g., 11GB for 1.2K documents with average length 74 tokens for an 8B-parameter model). PCED is best suited for relatively static, high-read-frequency corpora (Corallo et al., 13 Jan 2026).
7. Extensions, Related Approaches, and Future Prospects
PCED draws conceptual inspiration from product-of-experts models used for controlled generation (DExperts) (Liu et al., 2021), but specializes the paradigm to retrieval-augmented settings with dynamic, retrieval-weighted, and contrastively calibrated expert mixing.
Potential future directions include:
- End-to-end training to learn expert selection and dynamic weighting via a gating network, removing reliance on external retrieval scores.
- Hybrid attention-decoding architectures that support sparse cross-document attention for tractable scaling.
- Extending the framework to support parametric and non-parametric mixtures, and to integrate explicit chain-of-thought expert streams.
- Application to multi-attribute and hierarchical expert settings, as well as style transfer domains (Corallo et al., 13 Jan 2026, Liu et al., 2021).
In summary, PCED defines a robust, efficient decoding-time algorithm for multi-document retrieval-augmented generation, combining per-document experts via a retrieval-aware, contrastive fusion rule that maintains accuracy and dramatically improves inference scalability.