Poly-PRAG: Latent Routing for Scalable LLM Adaptation
- Poly-PRAG is a latent routing encoding method that combines dynamic routing with parameter-efficient LoRA adapters to encode document representations.
- It uses a many-to-few mapping strategy by employing a small, trainable pool of K latent experts and a routing function to significantly reduce storage and inference costs.
- Extensive experiments on QA datasets demonstrate that Poly-PRAG achieves state-of-the-art F1 improvements while lowering the parameter footprint compared to traditional PRAG methods.
The latent routing encoding process, known as Poly-PRAG, is a framework for scalable parametric retrieval-augmented generation (RAG) that unifies innovations from both dynamic routing (as in capsule networks or sequence encoding) and parameter-efficient adaptation of LLMs via Low-Rank Adaptation (LoRA) modules. It addresses the scalability and efficiency limitations of previous retrieval-parametric systems by replacing the one-adapter-per-document regime with a small, trainable pool of latent experts, employing a routing function to encode and decode the entire document space with high parametric efficiency (Su et al., 21 Nov 2025).
1. Latent Routing in PRAG: Core Mechanism and Motivation
Poly-PRAG builds on the PRAG paradigm for LLMs, where external knowledge is injected via LoRA adapters directly into model weights. Traditional PRAG employs a one-to-one mapping: each document receives its own LoRA adapter. This approach is hindered by data scarcity (limited document-specific training examples) and prohibitive inference costs (requiring the constant loading and merging of distinct adapters per retrieval).
Poly-PRAG introduces a many-to-few mapping: instead of assigning each document its own adapter, it defines latent experts and a routing function mapping each document to the simplex . Passage encoding is thus realized as the soft or sparse combination of these expert adapters, enabling order-of-magnitude reductions in both parameter count and inference cost.
2. Offline Encoding: Multi-Task Training and Adapter Construction
During the offline phase, Poly-PRAG treats each passage as a unique task, learning:
- A pool of LoRA expert adapters, each in factored form with .
- Task-specific routing vectors for all passages, collected row-wise into a routing matrix .
For document , the routing coefficients are computed via (Gumbel-)softmax: where controls the temperature. The document-specific LoRA perturbation is assembled as: Training proceeds via multi-task optimization: for each document and each (input, target) pair from its synthetic QA-augmented dataset , gradients flow into both the adapters and the routing weights via the log-likelihood loss, optionally regularized with penalties on adapter parameters.
Training Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: Corpus passages {t=1…|𝓣|}, synthetic QA sets 𝒟_t
Initialize: {A_i,B_i}_i=1…K, Z∈ℝ^{|𝓣|×K} (routing logits)
For epoch = 1…N_epochs:
For t in 1…|𝓣|:
Sample minibatch M_t ⊂ 𝒟_t
z_t ← Z[t,:]
α(t) ← softmax(z_t / τ)
A^t ← ∑_{i=1}^K α_i(t) A_i; B^t ← ∑_{i=1}^K α_i(t) B_i
ΔW_t ← A^t(B^t)^⊺
Loss L = ∑_{(x,y)∈M_t} −log P_{W₀+ΔW_t}(y|x)
Backprop L (w.r.t. {A_i,B_i}, Z[t,:])
Return {A_i,B_i} and Z |
3. Online Inference: Efficient Routing and Query-Time Integration
At inference, Poly-PRAG avoids the repeated loading/unloading of myriad adapters:
- Retrieve top- relevant documents for query .
- For each document, obtain routing logits (or precomputed ).
- Optionally refine routing based on the query representation :
- Compute merged adapters per passage:
- Aggregate across passages:
- Update weights and decode: .
Two routing modes are supported: "soft" (full-mix) and "hard" (top-m, zero others), further reducing inference overhead.
Online Routing and Generation
1 2 3 4 5 6 7 8 9 10 |
Input: query q
Retrieve top-c passages {t₁,…,t_c}
For each t_j:
get z_{t_j} from Z
α(t_j) ← softmax(z_{t_j}/τ)
A^{t_j} ← ∑_{i=1}^K α_i(t_j) A_i; B^{t_j} ← ∑_{i=1}^K α_i(t_j) B_i
ΔW^{t_j} ← A^{t_j}(B^{t_j})^⊺
ΔW_total ← ∑_{j=1}^c ΔW^{t_j}
W ← W₀ + ΔW_total
Generate answer y* = argmax_y P_W(y | q) |
Efficiency Analysis
Storage and compute complexity are reduced from (PRAG) to (Poly-PRAG). At inference, FLOPs scale as for Poly-PRAG versus for per-document merges.
4. Key Experimental Findings and Ablations
Extensive benchmarking across knowledge-intensive QA datasets (2WikiMultihopQA, HotpotQA, PopQA, ComplexWebQuestions) with LLaMa3 and Qwen2 LLMs shows significant F1 improvements and orders-of-magnitude reduction in storage:
| Base LLM | Method | Avg F1 (%) |
|---|---|---|
| LLaMa3-2.1B | Vanilla | 22.8 |
| Standard RAG | 27.5 | |
| PRAG | 27.0 | |
| DyPRAG | 28.8 | |
| Poly-PRAG | 32.7 (+3.9) | |
| Qwen2.5-1.5B | PRAG | 28.2 |
| Poly-PRAG | 30.3 (+2.1) | |
| LLaMa3-8B | PRAG | 41.6 |
| Poly-PRAG | 42.7 (+1.1) |
Ablation studies indicate that increasing the number of latent experts improves F1 up to around , after which gains plateau or decrease slightly. In terms of storage, Poly-PRAG achieves comparable or superior F1 with of PRAG's required offline parameter storage:
- PRAG rank-1: 1020 MB, F1=13.1
- Poly-PRAG rank-1: 10 MB, F1=12.3
- Poly-PRAG rank-4: 42 MB, F1=15.1
5. Connections: Dynamic Routing, Capsule Networks, and Sequence Aggregation
The latent routing process in Poly-PRAG is conceptually related to dynamic routing mechanisms developed for capsule networks and advanced sequence encoding (Gong et al., 2018). In the latter, sequence elements encoded (via BiLSTM or CNN) to vectors are aggregated into a small set of higher-level "capsules" via an iterative routing-by-agreement process, parameterized by routing logits and coupling coefficients . The core steps:
- Compute coupling coefficients via softmax over .
- Each sends a linear transformed "vote" to capsule .
- Capsules aggregate these messages, apply a nonlinear squash, and routing logits are updated via agreement.
- Iterated times, output capsules are concatenated for downstream use.
Analogously, Poly-PRAG routes passage-specific representations into K shared LoRA experts using a learned mixing function, performing a simplex-weighted sum rather than iterative agreement-maximization. Both frameworks replace static or naively pooled representations with a flexible, trainable routing process that enables parameter-sharing and semantic compositionality (Gong et al., 2018, Su et al., 21 Nov 2025).
6. Limitations and Future Directions
- Corpus Dependence: Poly-PRAG requires enumerating and training over a fixed passage set. Introduction of new passages necessitates re-training of both routing weights and potentially the expert pool. Development of open-vocabulary or zero-shot capable routing remains open.
- Lack of Semantic Interpretability: Latent expert adapters are not semantically labeled or constrained. Interpretability and topic-disentanglement in expert formation are avenues for future research.
- Efficiency–Accuracy Tradeoff: Increasing yields diminishing F1 returns past a certain point. There exists an operational tradeoff between storage efficiency and model expressivity; this boundary is dataset-dependent.
A plausible implication is that further extensions could incorporate query-adaptive routing mechanisms or jointly-learned topic-aware expert decompositions, enhancing both flexibility and interpretability (Su et al., 21 Nov 2025).
7. Summary and Impact
Poly-PRAG's latent routing encoding compresses the representation of a vast document corpus into a compact set of shared parameter-efficient experts, with a learned routing map specifying document-to-expert allocation. This approach yields 100-fold reductions in storage and substantial improvements in both training and inference speed compared to one-to-one adapter methods, while achieving state-of-the-art results on multiple knowledge-intensive reasoning tasks (Su et al., 21 Nov 2025). Its theoretical underpinnings and design share key principles with dynamic routing approaches for sequence aggregation, highlighting a deepening unification between adaptive, agreement-driven representation learning and scalable, low-rank adaptation strategies for LLMs (Gong et al., 2018).