Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MMOA-RAG: Joint Optimization for RAG Pipelines

Updated 5 October 2025
  • MMOA-RAG is a multi-module framework that leverages multi-agent reinforcement learning to treat each RAG component as an individual agent.
  • It employs MAPPO to coordinate query rewriting, document selection, and answer generation, addressing misaligned module objectives with a unified F1-based reward.
  • Empirical results show significant improvements in F1 scores on datasets like HotpotQA, demonstrating robust adaptability and modular extension capabilities.

A Multi-Module joint Optimization Algorithm for Retrieval-Augmented Generation (MMOA-RAG) is a framework that conceptualizes the standard RAG pipeline as a cooperative multi-agent system, with each key module—query rewriting, retrieval, document selection, and answer generation—treated as an individual reinforcement learning agent. The design addresses the central challenge of misaligned module-level objectives in traditional pipelines, where supervised fine-tuning of isolated components often fails to maximize end-to-end answer accuracy. Instead, MMOA-RAG uses multi-agent reinforcement learning (MARL), specifically multi-agent proximal policy optimization (MAPPO), to harmonize all agent objectives under a unified reward tied directly to the QA target metric (F1 score). This paradigm enables robust, adaptive optimization across the retrieval-augmented generation workflow.

1. Architectural Formulation and Agent Interaction

MMOA-RAG implements a full RAG stack as a sequence of interacting agents: the Query Rewriter, Selector, and Generator are treated as parameterized RL agents, while the Retriever is regarded as part of the static environment (i.e., a non-learning dense retriever using vector search). The workflow is defined as follows:

  • The Query Rewriter processes the initial question qq (given prompt PromptQRPrompt_{\text{QR}}) and produces sub-questions subq=QR(q,PromptQR)subq = \text{QR}(q, Prompt_{\text{QR}}) to enhance retrievability and disambiguate complex queries.
  • The Retriever receives subqsubq and performs dense retrieval over the external corpus, returning KK candidate documents.
  • The Selector, based on the original question, prompt PromptSPrompt_{\text{S}}, and candidate set DD, outputs a refined subset Dselected=S(q,PromptS,D)D_{\text{selected}} = \text{S}(q, Prompt_{\text{S}}, D), focusing the context for generation.
  • The Generator then produces the final answer from Dselected,q,PromptGD_{\text{selected}}, q, Prompt_{\text{G}}.

Each agent’s observation OiO_i, action space AiA_i, and reward function RiR_i are defined for the MARL setup, with explicit structure ensuring that the output of one module is the input to the next.

2. Joint Multi-Agent Reinforcement Learning Optimization

Optimization uses MAPPO, with all agents sharing a global reward that directly reflects the answer quality (e.g., final F1 score) augmented by local penalty regularizers:

Ri=Rshared+PiR_i = R_{\text{shared}} + P_i

where RsharedR_{\text{shared}} is the task's overall F1 score and PiP_i penalizes undesirable behaviors such as excessive sub-question production, redundant document selection, or overly verbose answers.

The training loss combines actor and critic losses:

L(θ,ϕ)=LActor(θ)+αLCritic(ϕ)\mathcal{L}(\theta, \phi) = \mathcal{L}_{\text{Actor}}(\theta) + \alpha \cdot \mathcal{L}_{\text{Critic}}(\phi)

Actor loss for PPO is defined, summed over all agents and time steps:

LActor(θ)=itmin[rtiA^ti,clip(rti,1ϵ,1+ϵ)A^ti]\mathcal{L}_{\text{Actor}}(\theta) = \sum_i \sum_t \min \Big[r_t^i \hat{A}_t^i, \text{clip}(r_t^i, 1-\epsilon, 1+\epsilon) \hat{A}_t^i\Big]

with the importance sampling ratio rtir_t^i:

rti=πθ(atisti)πθold(atisti)r_t^i = \frac{\pi_{\theta}(a_t^i | s_t^i)}{\pi_{\theta_{\text{old}}}(a_t^i | s_t^i)}

and generalized advantage estimation (GAE) for A^ti\hat{A}_t^i:

A^ti=l=0(γλ)lδt+li,δti=R(sti,ati)+γVϕ(st+1i)Vϕ(sti)\hat{A}_t^i = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^i, \qquad \delta_{t}^i = R(s_{t}^i, a_t^i) + \gamma V_{\phi}(s_{t+1}^i) - V_{\phi}(s_{t}^i)

Terminal step rewards further include a divergence penalty ensuring the RL policy does not drift excessively from warm-start behavior initialized by supervised fine-tuning.

3. Empirical Evaluation and Ablation Studies

MMOA-RAG was evaluated on several QA datasets (HotpotQA, 2WikiMultiHopQA, AmbigQA) against a comprehensive battery of baselines: vanilla LLM, standard RAG (with and without SFT), and recent multi-component alternatives (SELF-RAG, RetRobust, Rewrite-Retrieve-Read, and BGM). Across all benchmarks, MMOA-RAG achieved superior F1, EM, and accuracy:

  • On HotpotQA: F1 improved by \sim+1.80 over best baseline.
  • On 2WikiMultiHopQA and other multi-hop settings: gains were sustained or increased, especially where multi-hop reasoning amplifies the benefits of module coordination.

Ablation experiments confirm that full three-agent optimization (QR+S+G) yields the best results; removing any module degrades system performance, and two-agent (e.g., S+G) variants provide intermediate improvement. Additionally, generalization to new domain/architectural configurations was validated.

4. Adaptability and Modular Extension

The MMOA-RAG design supports adaptation to pipelines with non-standard components. By modelling each as an RL agent, e.g., introducing rerankers, more granular selectors, or multi-hop expansion modules, the system can be extended without changes to the central MARL mechanism. The framework accommodates removal or substitution of individual modules (e.g., two-agent versions) and demonstrates consistent gains even when the Retriever backend is swapped or the Generator is replaced.

Penalties are modular: for example, PQRP_{\text{QR}} for number of sub-questions, PSP_{\text{S}} for duplicate/irrelevant selection, and PGP_{\text{G}} for generation verbosity. This allows for system tuning under varying operational constraints or dataset idiosyncrasies.

5. Implementation Details and Resource Requirements

The implementation is available at https://github.com/chenyiqun/MMOA-RAG. The code includes:

  • Supervised warm-start procedures for agent initialization.
  • Contriever retrieval model configuration.
  • Shared parameter mechanisms to efficiently run multiple agents on a single LLM instance (critical for resource-limited scenarios).
  • MAPPO training routines with user-settable hyperparameters (learning rate, batch, clip range, buffer, etc.).

Computational requirements are in line with typical RL-based LLM pipelines—single or distributed GPU nodes, with agent parameter sharing for reduced memory overhead. Training stability benefits from actor-critic separation and reward clipping.

6. Comparative Advantages and State-of-the-Art Claims

MMOA-RAG establishes a new state-of-the-art among modular RAG optimization strategies. By addressing inter-module misalignment and enforcing cooperation toward the final answer, it achieves not only absolute gains in answer F1 but also superior robustness, generalizability to novel queries or domains, and performance in out-of-domain settings—outperforming both established supervised and modern RL-based baselines.

A notable practical implication is that the MARL formulation enables systematic troubleshooting: contributions of individual modules are directly measurable, and the framework supports granular hyperparameter and penalty optimization depending on desired trade-offs between answer quality, computational budget, and response latency.

7. Summary Table: MMOA-RAG Functional Modules and Optimization Strategy

Agent/Module Input Output RL Optimization Penalty Term
Query Rewriter q, PromptQR_{\text{QR}} sub-questions Yes Excessive sub-questions
Retriever sub-questions candidate docs No
Selector q, PromptS_{\text{S}}, docs selected doc subset Yes Duplicates, verbosity
Generator q, PromptG_{\text{G}}, docs predicted answer Yes Output length, format

All RL agents are optimized with a global F1-reward and local penalties.


MMOA-RAG establishes a principled, end-to-end learning mechanism for complex RAG pipelines by leveraging multi-agent RL to coordinate all major modules. This MARL-driven joint optimization directly aligns the objective of every agent with the ultimate quality of model outputs, thus robustly boosting retrieval-augmented QA performance across a range of benchmarks and configurations (Chen et al., 25 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MMOA-RAG Framework.