Poly-PRAG: Latent Routing for Scalable LLM Adaptation

Updated 28 November 2025

Poly-PRAG is a latent routing encoding method that combines dynamic routing with parameter-efficient LoRA adapters to encode document representations.
It uses a many-to-few mapping strategy by employing a small, trainable pool of K latent experts and a routing function to significantly reduce storage and inference costs.
Extensive experiments on QA datasets demonstrate that Poly-PRAG achieves state-of-the-art F1 improvements while lowering the parameter footprint compared to traditional PRAG methods.

The latent routing encoding process, known as Poly-PRAG, is a framework for scalable parametric retrieval-augmented generation (RAG) that unifies innovations from both dynamic routing (as in capsule networks or sequence encoding) and parameter-efficient adaptation of LLMs via Low-Rank Adaptation (LoRA) modules. It addresses the scalability and efficiency limitations of previous retrieval-parametric systems by replacing the one-adapter-per-document regime with a small, trainable pool of latent experts, employing a routing function to encode and decode the entire document space with high parametric efficiency (Su et al., 21 Nov 2025).

1. Latent Routing in PRAG: Core Mechanism and Motivation

Poly-PRAG builds on the PRAG paradigm for LLMs, where external knowledge is injected via LoRA adapters directly into model weights. Traditional PRAG employs a one-to-one mapping: each document receives its own LoRA adapter. This approach is hindered by data scarcity (limited document-specific training examples) and prohibitive inference costs (requiring the constant loading and merging of distinct adapters per retrieval).

Poly-PRAG introduces a many-to-few mapping: instead of assigning each document its own adapter, it defines $K \ll |\mathcal{T}|$ latent experts $\{\Delta W_1,\ldots,\Delta W_K\}$ and a routing function mapping each document $t$ to the simplex $\alpha(t)\in\Delta^{K-1}$ . Passage encoding is thus realized as the soft or sparse combination of these $K$ expert adapters, enabling order-of-magnitude reductions in both parameter count and inference cost.

2. Offline Encoding: Multi-Task Training and Adapter Construction

During the offline phase, Poly-PRAG treats each passage $t=1...|\mathcal{T}|$ as a unique task, learning:

A pool of $K$ LoRA expert adapters, each in factored form $\Delta W_i = A_i B_i^{\top}$ with $A_i,B_i \in \mathbb{R}^{d_{\text{out}}\times r}$ .
Task-specific routing vectors $z_t\in\mathbb{R}^K$ for all passages, collected row-wise into a routing matrix $Z\in\mathbb{R}^{|\mathcal{T}|\times K}$ .

For document $t$ , the routing coefficients $r_i(t)$ are computed via (Gumbel-)softmax: $r_i(t)=\alpha_i(t)=\frac{\exp(z_{t,i}/\tau)}{\sum_{j=1}^{K}\exp(z_{t,j}/\tau)} \quad i=1,...,K$ where $\tau>0$ controls the temperature. The document-specific LoRA perturbation is assembled as: $W_t = W_0 + \sum_{i=1}^K r_i(t)\Delta W_i = W_0 + \sum_{i=1}^K \alpha_i(t)(A_i B_i^{\top})$ Training proceeds via multi-task optimization: for each document and each (input, target) pair from its synthetic QA-augmented dataset $\mathcal{D}_t$ , gradients flow into both the adapters $\{A_i,B_i\}$ and the routing weights $Z[t,:]$ via the log-likelihood loss, optionally regularized with $L_2$ penalties on adapter parameters.

Training Pseudocode

Input: Corpus passages {t=1…|𝓣|}, synthetic QA sets 𝒟_t
Initialize: {A_i,B_i}_i=1…K, Z∈ℝ^{|𝓣|×K} (routing logits)
For epoch = 1…N_epochs:
  For t in 1…|𝓣|:
    Sample minibatch M_t ⊂ 𝒟_t
    z_t ← Z[t,:]
    α(t) ← softmax(z_t / τ)
    A^t ← ∑_{i=1}^K α_i(t) A_i; B^t ← ∑_{i=1}^K α_i(t) B_i
    ΔW_t ← A^t(B^t)^⊺
    Loss L = ∑_{(x,y)∈M_t} −log P_{W₀+ΔW_t}(y|x)
    Backprop L (w.r.t. {A_i,B_i}, Z[t,:])
Return {A_i,B_i} and Z

(Su et al., 21 Nov 2025)

3. Online Inference: Efficient Routing and Query-Time Integration

At inference, Poly-PRAG avoids the repeated loading/unloading of myriad adapters:

Retrieve top- $c$ relevant documents $\{t_1,...,t_c\}$ for query $q$ .
For each document, obtain routing logits $z_{t_j}$ (or precomputed $\alpha(t_j)$ ).
Optionally refine routing based on the query representation $\phi(q)$ : $z(q) = W_r\phi(q),\quad r_i(q)=\mathrm{softmax}_i(z(q))$
Compute merged adapters per passage: $A^{t_j} = \sum_{i=1}^K r_i(t_j)A_i,\quad B^{t_j} = \sum_{i=1}^K r_i(t_j)B_i$
Aggregate across passages: $\Delta W_{\mathrm{total}} = \sum_{j=1}^c A^{t_j}(B^{t_j})^\top$
Update weights and decode: $W = W_0 + \Delta W_{\mathrm{total}}$ .

Two routing modes are supported: "soft" (full-mix) and "hard" (top-m, zero others), further reducing inference overhead.

Online Routing and Generation

Input: query q
Retrieve top-c passages {t₁,…,t_c}
For each t_j:
  get z_{t_j} from Z
  α(t_j) ← softmax(z_{t_j}/τ)
  A^{t_j} ← ∑_{i=1}^K α_i(t_j) A_i; B^{t_j} ← ∑_{i=1}^K α_i(t_j) B_i
  ΔW^{t_j} ← A^{t_j}(B^{t_j})^⊺
ΔW_total ← ∑_{j=1}^c ΔW^{t_j}
W ← W₀ + ΔW_total
Generate answer y* = argmax_y P_W(y | q)

(Su et al., 21 Nov 2025)

Efficiency Analysis

Storage and compute complexity are reduced from $O(|\mathcal{T}| \cdot r \cdot d_{\text{out}}\cdot d_{\text{in}})$ (PRAG) to $O(K \cdot r \cdot d_{\text{out}} \cdot d_{\text{in}})$ (Poly-PRAG). At inference, FLOPs scale as $O(c \cdot m \cdot r \cdot d_{\text{in}})$ for Poly-PRAG versus $O(c \cdot r \cdot d_{\text{in}}\cdot d_{\text{out}})$ for per-document merges.

4. Key Experimental Findings and Ablations

Extensive benchmarking across knowledge-intensive QA datasets (2WikiMultihopQA, HotpotQA, PopQA, ComplexWebQuestions) with LLaMa3 and Qwen2 LLMs shows significant F1 improvements and orders-of-magnitude reduction in storage:

Base LLM	Method	Avg F1 (%)
LLaMa3-2.1B	Vanilla	22.8
	Standard RAG	27.5
	PRAG	27.0
	DyPRAG	28.8
	Poly-PRAG	32.7 (+3.9)
Qwen2.5-1.5B	PRAG	28.2
	Poly-PRAG	30.3 (+2.1)
LLaMa3-8B	PRAG	41.6
	Poly-PRAG	42.7 (+1.1)

Ablation studies indicate that increasing the number of latent experts $K$ improves F1 up to around $K=20$ , after which gains plateau or decrease slightly. In terms of storage, Poly-PRAG achieves comparable or superior F1 with $<1\%$ of PRAG's required offline parameter storage:

PRAG rank-1: 1020 MB, F1=13.1
Poly-PRAG rank-1: 10 MB, F1=12.3
Poly-PRAG rank-4: 42 MB, F1=15.1

(Su et al., 21 Nov 2025)

5. Connections: Dynamic Routing, Capsule Networks, and Sequence Aggregation

The latent routing process in Poly-PRAG is conceptually related to dynamic routing mechanisms developed for capsule networks and advanced sequence encoding (Gong et al., 2018). In the latter, sequence elements encoded (via BiLSTM or CNN) to vectors $u_i$ are aggregated into a small set of higher-level "capsules" $v_j$ via an iterative routing-by-agreement process, parameterized by routing logits $b_{ij}$ and coupling coefficients $c_{ij}$ . The core steps:

Compute coupling coefficients $c_{ij}$ via softmax over $b_{ij}$ .
Each $u_i$ sends a linear transformed "vote" $\hat{u}_{j|i}$ to capsule $v_j$ .
Capsules aggregate these messages, apply a nonlinear squash, and routing logits are updated via agreement.
Iterated $T$ times, output capsules are concatenated for downstream use.

Analogously, Poly-PRAG routes passage-specific representations into K shared LoRA experts using a learned mixing function, performing a simplex-weighted sum rather than iterative agreement-maximization. Both frameworks replace static or naively pooled representations with a flexible, trainable routing process that enables parameter-sharing and semantic compositionality (Gong et al., 2018, Su et al., 21 Nov 2025).

6. Limitations and Future Directions

Corpus Dependence: Poly-PRAG requires enumerating and training over a fixed passage set. Introduction of new passages necessitates re-training of both routing weights and potentially the expert pool. Development of open-vocabulary or zero-shot capable routing remains open.
Lack of Semantic Interpretability: Latent expert adapters are not semantically labeled or constrained. Interpretability and topic-disentanglement in expert formation are avenues for future research.
Efficiency–Accuracy Tradeoff: Increasing $K$ yields diminishing F1 returns past a certain point. There exists an operational tradeoff between storage efficiency and model expressivity; this boundary is dataset-dependent.

A plausible implication is that further extensions could incorporate query-adaptive routing mechanisms or jointly-learned topic-aware expert decompositions, enhancing both flexibility and interpretability (Su et al., 21 Nov 2025).

7. Summary and Impact

Poly-PRAG's latent routing encoding compresses the representation of a vast document corpus into a compact set of shared parameter-efficient experts, with a learned routing map specifying document-to-expert allocation. This approach yields 100-fold reductions in storage and substantial improvements in both training and inference speed compared to one-to-one adapter methods, while achieving state-of-the-art results on multiple knowledge-intensive reasoning tasks (Su et al., 21 Nov 2025). Its theoretical underpinnings and design share key principles with dynamic routing approaches for sequence aggregation, highlighting a deepening unification between adaptive, agreement-driven representation learning and scalable, low-rank adaptation strategies for LLMs (Gong et al., 2018).