MoPo: Momentum Posterior Regularization
- MoPo is a multi-hop dense retrieval framework that uses query-focused hop-wise summaries and momentum updates to stabilize knowledge distillation.
- It replaces traditional answer-based supervision with intermediate, query-specific summaries, reducing semantic drift during multi-hop retrieval.
- Empirical evaluations on HotpotQA and StrategyQA demonstrate substantial gains in recall and exact match without increasing inference costs.
Momentum Posterior Regularization (MoPo) is a framework for multi-hop dense retrieval in open-domain question answering (QA) that enables stable, effective knowledge distillation from a posterior retriever possessing oracle information into a practical prior retriever used at inference time. MoPo addresses the key challenges of posterior regularization in multi-hop settings by (1) replacing answer-based supervision with hop-wise, query-focused summaries and (2) introducing a momentum-based parameter update that constrains the teacher–student knowledge gap during optimization. Extensive empirical results on HotpotQA and StrategyQA demonstrate substantial performance improvements over competitive baselines without increasing inference cost (Xia et al., 2024).
1. Problem Setting and Model Architecture
In multi-hop dense retrieval, given an open-domain question and large corpus , the task is to retrieve a sequence of passages such that all requisite knowledge for answering is gathered. The joint probability is factorized as
At each hop , the query is reformulated as , where is a fixed function that concatenates the original with a short summary describing retrieved passages up to hop . The prior retriever (parameters ) is deployed at inference with access only to , whereas the posterior retriever (parameters ) is used in training and is privileged with access to query-focused summary reflecting gold knowledge of both previous and current hops.
2. Innovations in Posterior Regularization
2.1 Hop-wise Query-Focused Summaries
Conventional one-hop posterior regularization uses the final answer as posterior information for the teacher model. In multi-hop retrieval, the final answer is often decoupled from intermediate hops, rendering this approach ineffective. MoPo introduces hop-wise query-focused summaries that (i) fuse the current gold passage with preceding contextual summaries, and (ii) explicitly condition on the original question . These summaries function as compact, semantically anchored proxies for the gold context required to inform each retrieval step, and help avoid semantic drift typical in vanilla document concatenation.
2.2 Momentum-Based Posterior Updates
Standard knowledge distillation schedules pretrain the posterior model to convergence, then distill to the prior. This results in significant divergence ("knowledge gap") between and , causing unstable KL gradients or even performance degradation. MoPo introduces a momentum update for the posterior model:
where $0 < m < 1$ is the momentum coefficient. This exponential moving average maintains proximity between posterior and prior parameters throughout training, enabling a stable, smoothly varying KL regularization signal.
3. Mathematical Formalization
Both and are dual-encoder models mapping queries (either or ) and documents into vector embeddings scored by dot-product . At each hop :
- The prior defines
- The posterior analogously defines
The overall objective for a batch of triples is
where is the standard multi-hop contrastive loss using in-batch negatives. Only receives direct gradient updates; is updated via the momentum rule above.
4. Training Procedure and Implementation
Training proceeds as follows: initialize ; for each batch, at each hop, (a) build as ; (b) compute prior and posterior logits over candidate passages; (c) evaluate and the hop-wise KL term; (d) update via Adam and by momentum. Passage negatives are sampled in-batch from other positives or randomly from the corpus. The loss is summed over all hops, and experiment results show that momentum values in are optimal; lower increases the risk of destabilizing the exponential averaging.
5. Empirical Evaluation
MoPo is evaluated on HotpotQA and StrategyQA for multi-hop retrieval and downstream QA performance.
- Retrieval: On HotpotQA, MoPo achieves recall@2 and EM@2 , outperforming iterative DPR (MDR) (recall@2 , EM@2 ). On StrategyQA, MoPo attains recall@2 and EM@2 compared to MDR's and .
- Posterior regularization baselines: Two-stage PR_fixed and PR_dyn underperform both the MDR_sum (concatenation+summarization) baseline and MoPo, confirming that naïve KL regularization is unstable for multi-hop QA.
- Ablations: MoPo’s performance is robust to (performance gap ‰ vs. ‰ for PR_fixed) and is best with momentum in .
- Downstream QA and reranking: With a lightweight cross-encoder reranker, MoPo achieves EM@2 on HotpotQA full-wiki, outperforming BeamDR and Chain-of-Skills. In generation (retrieval→rerank→Flan-T5), MoPo ties prior SOTA joint EM and exceeds original MDR (). On 100-sample Hotpot, MoPo (F1 ) outperforms BeamAggR (F1 ) (Xia et al., 2024).
6. Discussion, Insights, and Limitations
The use of momentum to couple the posterior model to a moving average of the prior prevents divergence between the models (“runaway” teacher), keeps the KL teacher signal consistent, and yields smooth, rapid convergence. Query-focused summaries as posterior information provide strong semantic anchors, reducing drift and improving retrieval accuracy at each hop. MoPo generalizes effectively: training on HotpotQA summaries alone increases StrategyQA EM by more than over the MDR baseline. A plausible implication is that this hop-wise distillation may facilitate transfer to new domains with similar reasoning structures.
Persisting limitations include the absence of a complete theoretical analysis of how momentum regularization constrains the teacher–student divergence, and incomplete benchmarking against LLM-based multi-hop retrievers at inference. Further work could clarify the relationship between the smoothing hyperparameter and convergence stability, and evaluate potential integration of MoPo with retrieval-augmented generation frameworks.
7. Summary Table: MoPo Algorithm Components
| Component | Role in MoPo | Notes on Contrast/Baselines |
|---|---|---|
| Query-focused summaries | Compact, hop-wise posterior info at train time | Avoids limitations of answer-based |
| Momentum posterior update | Couple posterior (φ) to prior (θ) via EMA | Prevents “student-teacher drift” |
| Dual-encoder scoring | Dot-product retrieval over queries and documents | Standard in dense retrieval |
| InfoNCE + KL regularization | One-stage loss with hop-aggregated KL | Unstable in two-stage (non-MoPo) PR |
The MoPo framework provides a robust and scalable recipe for regularized multi-hop retrieval and QA through hop-wise posterior summarization and momentum-constrained distillation, offering measurable benefits on multi-hop benchmarks without added inference overhead (Xia et al., 2024).