Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel Context-of-Experts Decoding (PCED)

Updated 17 January 2026
  • The paper introduces PCED, a decoding framework that aggregates evidence from parallel expert caches using a retrieval-aware contrastive fusion rule.
  • PCED achieves efficient multi-document reasoning with up to 180× latency improvements and competitive performance on benchmarks like NQ and HotpotQA.
  • The method bypasses full-context attention bottlenecks but relies on full logits access and high-quality retrieval scores to operate effectively.

Parallel Context-of-Experts Decoding (PCED) is a framework developed for efficient and effective retrieval-augmented text generation (RAG). It establishes a method for aggregating evidence across multiple retrieved documents without the computational or architectural overhead of creating a joint long-context attention map. PCED operates at decoding time through an ensemble of independent "expert" contexts, with next-token distributions synchronized using a retrieval-aware contrastive objective. This design achieves strong multi-document reasoning capabilities and substantial latency improvements over standard long-context approaches and prior ensemble methods (Corallo et al., 13 Jan 2026, Liu et al., 2021).

1. Motivation and Problem Formulation

In conventional retrieval-augmented generation, there is a fundamental trade-off between supporting multi-document reasoning and achieving inference efficiency. Concatenating all retrieved documents into a single prompt enables the model's self-attention to freely integrate information ("multi-hop") across documents, which is essential for difficult QA, few-shot learning, and reasoning tasks. However, self-attention over a very long context (Ltotalk=1KLk+LqueryL_{\mathrm{total}} \approx \sum_{k=1}^K L_k + L_{\mathrm{query}}) incurs O(Ltotal2)O(L_{\mathrm{total}}^2) time and O(Ltotald)O(L_{\mathrm{total}} \cdot d) memory during the prefill step, presenting major latency bottlenecks as the number of documents (KK) or their length increases. Additionally, models struggle with "lost-in-the-middle" effects, where crucial information is buried in very long contexts.

An alternative approach encodes each retrieved document into a separate key/value (KV) cache, which can be precomputed offline. This allows rapid time-to-first-token (TTFT) at inference, as the model only needs to process the query and fetch relevant KV caches. However, cross-document attention is never constructed, breaking the model’s ability to aggregate information across documents and severely limiting multi-document and multi-hop capabilities (Corallo et al., 13 Jan 2026).

2. Framework Overview

PCED is a training-free decoding algorithm designed to recover multi-document reasoning without forming a global attention map. The method applies two key ideas:

  • Each retrieved document is treated as an isolated "expert," using its own frozen KV-cache.
  • Aggregation of information is performed during decoding by synchronizing the independent next-token proposals from all experts via a retrieval-aware, contrastive fusion rule.

High-Level Pipeline

  1. Datastore Construction: Offline, for each document did_i in the corpus, an embedding eie_i and precomputed KV-cache KiK_i are created and stored.
  2. Retrieval: At inference, the top-NN documents d1,...,dNd_1, ..., d_N are retrieved with normalized relevance scores r1,...,rN(0,1)r_1, ..., r_N \in (0,1).
  3. Expert Initialization: N+1N+1 parallel experts are instantiated:
    • Expert 0 ("amateur"): No context cache, representing the model prior (K0=K_0 = \emptyset).
    • Experts k=1...Nk=1...N ("contextual experts"): Each with its own KkK_k cache.
  4. Autoregressive Decoding: At every timestep tt, each expert proposes a next-token distribution. These distributions are combined using the contrastive decoding rule to select the token for generation, and the same token is appended to all experts’ histories. By synchronizing histories, generated tokens implicitly merge evidence across documents over time (Corallo et al., 13 Jan 2026).

3. Retrieval-Aware Contrastive Decoding Rule

The core of PCED is a decoding-time logit fusion mechanism, parameterized as follows:

  • Let the vocabulary be VV, with V|V| tokens.
  • Each expert kk at time tt produces logits sktRVs_k^t \in \mathbb{R}^{|V|}; the "amateur" gives s0ts_0^t.
  • Each expert kk has a normalized retrieval score rk(0,1)r_k \in (0,1).
  • β00\beta_0 \geq 0 controls the strength of the contrastive correction; γ0\gamma \geq 0 gates based on retrieval score.

For each expert k=1Nk=1 \dots N, the fused logits are:

s^kt=(1+β0)sktβ0s0t+γlogrk=skt+β0(skts0t)+γlogrk\hat{s}_k^t = (1 + \beta_0) \, s_k^t - \beta_0 \, s_0^t + \gamma \log r_k = s_k^t + \beta_0 (s_k^t - s_0^t) + \gamma \log r_k

  • skts0ts_k^t - s_0^t is the contrastive term, up-ranking tokens favored by the contextual expert over the prior.
  • γlogrk\gamma \log r_k injects retrieval strength as a prior, suppressing irrelevant experts.

Token selection is performed by taking the maximum over all experts:

yt=argmaxvVmaxk=1Ns^kt(v)y_t = \arg\max_{v \in V} \max_{k=1 \dots N} \hat{s}_k^t(v)

At t=1t=1, β0\beta_0 is adaptively set via Jensen–Shannon divergence between each sk1s_k^1 and s01s_0^1, then fixed. Empirical values of β00.51.0\beta_0 \approx 0.5-1.0 and γ2.5\gamma \approx 2.5 are robust across models and tasks (Corallo et al., 13 Jan 2026).

4. Algorithmic Steps and Complexity

The autoregressive PCED decoding process is structured as:

  1. Input: Query qq, frozen expert caches K1KNK_1 \dots K_N, relevance {rk}\{r_k\}.
  2. Initialization: Shared generation history H=system_prompt+qH = \langle\mathrm{system\_prompt}\rangle + q. Compute amateur logits s01s_0^1.
  3. Decoding Loop (for t=1...Tt = 1...T):
    • If t=1t=1, compute β0\beta_0 from Jensen–Shannon divergence between sk1s_k^1 and s01s_0^1.
    • For each expert k=1...Nk=1...N in parallel: compute skt=M.forward(Kk,H)s_k^t = M.\mathrm{forward}(K_k, H).
    • Compute fused logits s^kt\hat{s}_k^t.
    • For each token vVv \in V, take S(v)=maxks^kt(v)S(v) = \max_{k} \hat{s}_k^t(v).
    • Select and output yt=argmaxvS(v)y_t = \arg\max_v S(v). Append yty_t to HH and all expert caches.

Complexity Comparison:

Approach Per-token Complexity Prefill/TTFT
Full-context O(Ltotal2)O(L_{\text{total}}^2) + O(LtotaldV)O(L_{\text{total}} d |V|) Extremely high (long context)
KV-cache only O(Nd2)O(N d^2) Very low
PCED O((N+1)d2)+O(NV)O((N+1)d^2) + O(N|V|) Very low

Because NN is moderate ($8-32$), PCED's online cost is competitive for large-scale RAG (Corallo et al., 13 Jan 2026).

5. Comparative Evaluation and Empirical Performance

PCED surpasses previous parallel and agentic approaches, and rivals or exceeds full-context attention in various RAG benchmarks:

  • LOFT RAG (Mistral-13B): PCED-Dense achieves 81 EM on NQ versus 76 for full-context, and 80 versus 38 for single-document configurations.
  • HotpotQA: PCED provides +9–19 EM over standard KV-merge (APE) and +8 over MapReduce, closely matching full-prompt performance.
  • LongBench (Qwen3-8B): Improves multi-doc QA by 5–8 absolute points over full context; code completion improves by +9 points.
  • Latency: Time-to-first-token (TTFT) is reduced by up to 180× (0.14s for PCED vs. 25.5s for full prompt at K=90); end-to-end latency for 65k tokens and 512 output tokens is 1.7× faster.

PCED's decode-time logit fusion enables multi-document "stitching" without joint attention, offering rapid latency and strong accuracy even as context size scales (Corallo et al., 13 Jan 2026).

6. Limitations and Operational Constraints

Three primary limitations are identified:

  1. Access to Full Logits: PCED requires raw logits from each expert at every decoding step, excluding deployment via closed APIs or where only token samples or partial probabilities are available.
  2. Dependence on Retrieval Quality: If the most relevant document is not retrieved (rk0r_k \approx 0), PCED cannot recover its evidence. The retrieval-aware prior can suppress distractors, but cannot hallucinate absent information.
  3. FP16 Storage Overhead: Storing KV caches at scale can be expensive (e.g., 11GB for 1.2K documents with average length 74 tokens for an 8B-parameter model). PCED is best suited for relatively static, high-read-frequency corpora (Corallo et al., 13 Jan 2026).

PCED draws conceptual inspiration from product-of-experts models used for controlled generation (DExperts) (Liu et al., 2021), but specializes the paradigm to retrieval-augmented settings with dynamic, retrieval-weighted, and contrastively calibrated expert mixing.

Potential future directions include:

  • End-to-end training to learn expert selection and dynamic weighting via a gating network, removing reliance on external retrieval scores.
  • Hybrid attention-decoding architectures that support sparse cross-document attention for tractable scaling.
  • Extending the framework to support parametric and non-parametric mixtures, and to integrate explicit chain-of-thought expert streams.
  • Application to multi-attribute and hierarchical expert settings, as well as style transfer domains (Corallo et al., 13 Jan 2026, Liu et al., 2021).

In summary, PCED defines a robust, efficient decoding-time algorithm for multi-document retrieval-augmented generation, combining per-document experts via a retrieval-aware, contrastive fusion rule that maintains accuracy and dramatically improves inference scalability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Context-of-Experts Decoding (PCED).