Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoPo: Momentum Posterior Regularization

Updated 10 February 2026
  • MoPo is a multi-hop dense retrieval framework that uses query-focused hop-wise summaries and momentum updates to stabilize knowledge distillation.
  • It replaces traditional answer-based supervision with intermediate, query-specific summaries, reducing semantic drift during multi-hop retrieval.
  • Empirical evaluations on HotpotQA and StrategyQA demonstrate substantial gains in recall and exact match without increasing inference costs.

Momentum Posterior Regularization (MoPo) is a framework for multi-hop dense retrieval in open-domain question answering (QA) that enables stable, effective knowledge distillation from a posterior retriever possessing oracle information into a practical prior retriever used at inference time. MoPo addresses the key challenges of posterior regularization in multi-hop settings by (1) replacing answer-based supervision with hop-wise, query-focused summaries and (2) introducing a momentum-based parameter update that constrains the teacher–student knowledge gap during optimization. Extensive empirical results on HotpotQA and StrategyQA demonstrate substantial performance improvements over competitive baselines without increasing inference cost (Xia et al., 2024).

1. Problem Setting and Model Architecture

In multi-hop dense retrieval, given an open-domain question qq and large corpus D\mathcal{D}, the task is to retrieve a sequence of LL passages Dseq={d1,,dL}D_\text{seq} = \{d_1, \ldots, d_L\} such that all requisite knowledge for answering qq is gathered. The joint probability is factorized as

Pθ(Dseqq)=t=1LPθ(dtq,d1,,dt1)P_\theta(D_\text{seq}|q) = \prod_{t=1}^L P_\theta(d_t | q, d_1, \ldots, d_{t-1})

At each hop tt, the query is reformulated as qt=Gs(qt1,dt1)q_t = G_s(q_{t-1}, d_{t-1}), where GsG_s is a fixed function that concatenates the original qq with a short summary st1s_{t-1} describing retrieved passages up to hop t1t-1. The prior retriever MθM_\theta (parameters θ\theta) is deployed at inference with access only to qtq_t, whereas the posterior retriever MφM_\varphi (parameters φ\varphi) is used in training and is privileged with access to query-focused summary sts_t reflecting gold knowledge of both previous and current hops.

2. Innovations in Posterior Regularization

2.1 Hop-wise Query-Focused Summaries

Conventional one-hop posterior regularization uses the final answer as posterior information for the teacher model. In multi-hop retrieval, the final answer is often decoupled from intermediate hops, rendering this approach ineffective. MoPo introduces hop-wise query-focused summaries sts_t that (i) fuse the current gold passage dtd_t with preceding contextual summaries, and (ii) explicitly condition on the original question qq. These summaries function as compact, semantically anchored proxies for the gold context required to inform each retrieval step, and help avoid semantic drift typical in vanilla document concatenation.

2.2 Momentum-Based Posterior Updates

Standard knowledge distillation schedules pretrain the posterior model to convergence, then distill to the prior. This results in significant divergence ("knowledge gap") between MθM_\theta and MφM_\varphi, causing unstable KL gradients or even performance degradation. MoPo introduces a momentum update for the posterior model:

φ(t)mφ(t1)+(1m)θ(t1)\varphi^{(t)} \leftarrow m \cdot \varphi^{(t-1)} + (1-m) \cdot \theta^{(t-1)}

where $0 < m < 1$ is the momentum coefficient. This exponential moving average maintains proximity between posterior and prior parameters throughout training, enabling a stable, smoothly varying KL regularization signal.

3. Mathematical Formalization

Both MθM_\theta and MφM_\varphi are dual-encoder models mapping queries xx (either qtq_t or sts_t) and documents dd into vector embeddings (ex,ed)(e_x, e_d) scored by dot-product fθ(x,d)=exedf_\theta(x,d) = e_x^\top e_d. At each hop tt:

  • The prior defines pθ(dtqt)=exp(fθ(qt,dt))d{dt+dt}exp(fθ(qt,d))p_\theta(d_t|q_t) = \frac{\exp(f_\theta(q_t,d_t))}{\sum_{d \in \{d_t^+ \cup d_t^-\}} \exp(f_\theta(q_t,d))}
  • The posterior analogously defines pφ(dtst)p_\varphi(d_t|s_t)

The overall objective for a batch B\mathcal{B}' of (q,Sseq,Dseq)(q, S_\text{seq}, D_\text{seq}) triples is

L(θ)=LInfoNCE(θ;B)+λE(q,Sseq,Dseq)B[t=1LDKL(pφ(st)pθ(qt))]\mathcal{L}(\theta) = \mathcal{L}_\text{InfoNCE}(\theta;\mathcal{B}') + \lambda\, \mathbb{E}_{(q,S_\text{seq},D_\text{seq}) \in \mathcal{B}'} \Big[ \sum_{t=1}^L D_\text{KL}\left( p_\varphi(\cdot|s_t) \Vert p_\theta(\cdot|q_t) \right) \Big]

where LInfoNCE\mathcal{L}_\text{InfoNCE} is the standard multi-hop contrastive loss using in-batch negatives. Only θ\theta receives direct gradient updates; φ\varphi is updated via the momentum rule above.

4. Training Procedure and Implementation

Training proceeds as follows: initialize φ=θ\varphi = \theta; for each batch, at each hop, (a) build qtq_t as qst1q \oplus s_{t-1}; (b) compute prior and posterior logits over candidate passages; (c) evaluate LInfoNCE\mathcal{L}_\text{InfoNCE} and the hop-wise KL term; (d) update θ\theta via Adam and φ\varphi by momentum. Passage negatives dtd_t^- are sampled in-batch from other positives or randomly from the corpus. The loss is summed over all hops, and experiment results show that momentum values mm in [0.9,0.99][0.9, 0.99] are optimal; lower mm increases the risk of destabilizing the exponential averaging.

5. Empirical Evaluation

MoPo is evaluated on HotpotQA and StrategyQA for multi-hop retrieval and downstream QA performance.

  • Retrieval: On HotpotQA, MoPo achieves recall@2 94.77%94.77\% and EM@2 63.03%63.03\%, outperforming iterative DPR (MDR) (recall@2 94.34%94.34\%, EM@2 55.96%55.96\%). On StrategyQA, MoPo attains recall@2 43.36%43.36\% and EM@2 31.91%31.91\% compared to MDR's 42.64%42.64\% and 25.31%25.31\%.
  • Posterior regularization baselines: Two-stage PR_fixed and PR_dyn underperform both the MDR_sum (concatenation+summarization) baseline and MoPo, confirming that naïve KL regularization is unstable for multi-hop QA.
  • Ablations: MoPo’s performance is robust to λ\lambda (performance gap 3\leq3‰ vs. >7>7‰ for PR_fixed) and is best with momentum mm in [0.9,0.99][0.9, 0.99].
  • Downstream QA and reranking: With a lightweight cross-encoder reranker, MoPo achieves EM@2 89.4%89.4\% on HotpotQA full-wiki, outperforming BeamDR and Chain-of-Skills. In generation (retrieval→rerank→Flan-T5), MoPo ties prior SOTA joint EM 45.7%45.7\% and exceeds original MDR (41.8%41.8\%). On 100-sample Hotpot, MoPo (F1 64.3%64.3\%) outperforms BeamAggR (F1 62.9%62.9\%) (Xia et al., 2024).

6. Discussion, Insights, and Limitations

The use of momentum to couple the posterior model to a moving average of the prior prevents divergence between the models (“runaway” teacher), keeps the KL teacher signal consistent, and yields smooth, rapid convergence. Query-focused summaries as posterior information provide strong semantic anchors, reducing drift and improving retrieval accuracy at each hop. MoPo generalizes effectively: training on HotpotQA summaries alone increases StrategyQA EM by more than 6%6\% over the MDR baseline. A plausible implication is that this hop-wise distillation may facilitate transfer to new domains with similar reasoning structures.

Persisting limitations include the absence of a complete theoretical analysis of how momentum regularization constrains the teacher–student divergence, and incomplete benchmarking against LLM-based multi-hop retrievers at inference. Further work could clarify the relationship between the smoothing hyperparameter mm and convergence stability, and evaluate potential integration of MoPo with retrieval-augmented generation frameworks.

7. Summary Table: MoPo Algorithm Components

Component Role in MoPo Notes on Contrast/Baselines
Query-focused summaries Compact, hop-wise posterior info at train time Avoids limitations of answer-based
Momentum posterior update Couple posterior (φ) to prior (θ) via EMA Prevents “student-teacher drift”
Dual-encoder scoring Dot-product retrieval over queries and documents Standard in dense retrieval
InfoNCE + KL regularization One-stage loss with hop-aggregated KL Unstable in two-stage (non-MoPo) PR

The MoPo framework provides a robust and scalable recipe for regularized multi-hop retrieval and QA through hop-wise posterior summarization and momentum-constrained distillation, offering measurable benefits on multi-hop benchmarks without added inference overhead (Xia et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Momentum Posterior Regularization (MoPo).