Papers
Topics
Authors
Recent
Search
2000 character limit reached

Poly-PRAG: Latent Routing for Scalable LLM Adaptation

Updated 28 November 2025
  • Poly-PRAG is a latent routing encoding method that combines dynamic routing with parameter-efficient LoRA adapters to encode document representations.
  • It uses a many-to-few mapping strategy by employing a small, trainable pool of K latent experts and a routing function to significantly reduce storage and inference costs.
  • Extensive experiments on QA datasets demonstrate that Poly-PRAG achieves state-of-the-art F1 improvements while lowering the parameter footprint compared to traditional PRAG methods.

The latent routing encoding process, known as Poly-PRAG, is a framework for scalable parametric retrieval-augmented generation (RAG) that unifies innovations from both dynamic routing (as in capsule networks or sequence encoding) and parameter-efficient adaptation of LLMs via Low-Rank Adaptation (LoRA) modules. It addresses the scalability and efficiency limitations of previous retrieval-parametric systems by replacing the one-adapter-per-document regime with a small, trainable pool of latent experts, employing a routing function to encode and decode the entire document space with high parametric efficiency (Su et al., 21 Nov 2025).

1. Latent Routing in PRAG: Core Mechanism and Motivation

Poly-PRAG builds on the PRAG paradigm for LLMs, where external knowledge is injected via LoRA adapters directly into model weights. Traditional PRAG employs a one-to-one mapping: each document receives its own LoRA adapter. This approach is hindered by data scarcity (limited document-specific training examples) and prohibitive inference costs (requiring the constant loading and merging of distinct adapters per retrieval).

Poly-PRAG introduces a many-to-few mapping: instead of assigning each document its own adapter, it defines K≪∣T∣K \ll |\mathcal{T}| latent experts {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\} and a routing function mapping each document tt to the simplex α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}. Passage encoding is thus realized as the soft or sparse combination of these KK expert adapters, enabling order-of-magnitude reductions in both parameter count and inference cost.

2. Offline Encoding: Multi-Task Training and Adapter Construction

During the offline phase, Poly-PRAG treats each passage t=1...∣T∣t=1...|\mathcal{T}| as a unique task, learning:

  • A pool of KK LoRA expert adapters, each in factored form ΔWi=AiBi⊤\Delta W_i = A_i B_i^{\top} with Ai,Bi∈Rdout×rA_i,B_i \in \mathbb{R}^{d_{\text{out}}\times r}.
  • Task-specific routing vectors zt∈RKz_t\in\mathbb{R}^K for all passages, collected row-wise into a routing matrix {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}0.

For document {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}1, the routing coefficients {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}2 are computed via (Gumbel-)softmax: {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}3 where {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}4 controls the temperature. The document-specific LoRA perturbation is assembled as: {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}5 Training proceeds via multi-task optimization: for each document and each (input, target) pair from its synthetic QA-augmented dataset {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}6, gradients flow into both the adapters {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}7 and the routing weights {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}8 via the log-likelihood loss, optionally regularized with {ΔW1,…,ΔWK}\{\Delta W_1,\ldots,\Delta W_K\}9 penalties on adapter parameters.

Training Pseudocode

KK8 (Su et al., 21 Nov 2025)

3. Online Inference: Efficient Routing and Query-Time Integration

At inference, Poly-PRAG avoids the repeated loading/unloading of myriad adapters:

  1. Retrieve top-tt0 relevant documents tt1 for query tt2.
  2. For each document, obtain routing logits tt3 (or precomputed tt4).
  3. Optionally refine routing based on the query representation tt5: tt6
  4. Compute merged adapters per passage: tt7
  5. Aggregate across passages: tt8
  6. Update weights and decode: tt9.

Two routing modes are supported: "soft" (full-mix) and "hard" (top-m, zero others), further reducing inference overhead.

Online Routing and Generation

KK9 (Su et al., 21 Nov 2025)

Efficiency Analysis

Storage and compute complexity are reduced from α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}0 (PRAG) to α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}1 (Poly-PRAG). At inference, FLOPs scale as α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}2 for Poly-PRAG versus α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}3 for per-document merges.

4. Key Experimental Findings and Ablations

Extensive benchmarking across knowledge-intensive QA datasets (2WikiMultihopQA, HotpotQA, PopQA, ComplexWebQuestions) with LLaMa3 and Qwen2 LLMs shows significant F1 improvements and orders-of-magnitude reduction in storage:

Base LLM Method Avg F1 (%)
LLaMa3-2.1B Vanilla 22.8
Standard RAG 27.5
PRAG 27.0
DyPRAG 28.8
Poly-PRAG 32.7 (+3.9)
Qwen2.5-1.5B PRAG 28.2
Poly-PRAG 30.3 (+2.1)
LLaMa3-8B PRAG 41.6
Poly-PRAG 42.7 (+1.1)

Ablation studies indicate that increasing the number of latent experts α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}4 improves F1 up to around α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}5, after which gains plateau or decrease slightly. In terms of storage, Poly-PRAG achieves comparable or superior F1 with α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}6 of PRAG's required offline parameter storage:

  • PRAG rank-1: 1020 MB, F1=13.1
  • Poly-PRAG rank-1: 10 MB, F1=12.3
  • Poly-PRAG rank-4: 42 MB, F1=15.1

(Su et al., 21 Nov 2025)

5. Connections: Dynamic Routing, Capsule Networks, and Sequence Aggregation

The latent routing process in Poly-PRAG is conceptually related to dynamic routing mechanisms developed for capsule networks and advanced sequence encoding (Gong et al., 2018). In the latter, sequence elements encoded (via BiLSTM or CNN) to vectors α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}7 are aggregated into a small set of higher-level "capsules" α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}8 via an iterative routing-by-agreement process, parameterized by routing logits α(t)∈ΔK−1\alpha(t)\in\Delta^{K-1}9 and coupling coefficients KK0. The core steps:

  1. Compute coupling coefficients KK1 via softmax over KK2.
  2. Each KK3 sends a linear transformed "vote" KK4 to capsule KK5.
  3. Capsules aggregate these messages, apply a nonlinear squash, and routing logits are updated via agreement.
  4. Iterated KK6 times, output capsules are concatenated for downstream use.

Analogously, Poly-PRAG routes passage-specific representations into K shared LoRA experts using a learned mixing function, performing a simplex-weighted sum rather than iterative agreement-maximization. Both frameworks replace static or naively pooled representations with a flexible, trainable routing process that enables parameter-sharing and semantic compositionality (Gong et al., 2018, Su et al., 21 Nov 2025).

6. Limitations and Future Directions

  • Corpus Dependence: Poly-PRAG requires enumerating and training over a fixed passage set. Introduction of new passages necessitates re-training of both routing weights and potentially the expert pool. Development of open-vocabulary or zero-shot capable routing remains open.
  • Lack of Semantic Interpretability: Latent expert adapters are not semantically labeled or constrained. Interpretability and topic-disentanglement in expert formation are avenues for future research.
  • Efficiency–Accuracy Tradeoff: Increasing KK7 yields diminishing F1 returns past a certain point. There exists an operational tradeoff between storage efficiency and model expressivity; this boundary is dataset-dependent.

A plausible implication is that further extensions could incorporate query-adaptive routing mechanisms or jointly-learned topic-aware expert decompositions, enhancing both flexibility and interpretability (Su et al., 21 Nov 2025).

7. Summary and Impact

Poly-PRAG's latent routing encoding compresses the representation of a vast document corpus into a compact set of shared parameter-efficient experts, with a learned routing map specifying document-to-expert allocation. This approach yields 100-fold reductions in storage and substantial improvements in both training and inference speed compared to one-to-one adapter methods, while achieving state-of-the-art results on multiple knowledge-intensive reasoning tasks (Su et al., 21 Nov 2025). Its theoretical underpinnings and design share key principles with dynamic routing approaches for sequence aggregation, highlighting a deepening unification between adaptive, agreement-driven representation learning and scalable, low-rank adaptation strategies for LLMs (Gong et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Routing Encoding Process (Poly-PRAG).