MMOA-RAG: Joint Optimization for RAG Pipelines

Updated 5 October 2025

MMOA-RAG is a multi-module framework that leverages multi-agent reinforcement learning to treat each RAG component as an individual agent.
It employs MAPPO to coordinate query rewriting, document selection, and answer generation, addressing misaligned module objectives with a unified F1-based reward.
Empirical results show significant improvements in F1 scores on datasets like HotpotQA, demonstrating robust adaptability and modular extension capabilities.

A Multi-Module joint Optimization Algorithm for Retrieval-Augmented Generation (MMOA-RAG) is a framework that conceptualizes the standard RAG pipeline as a cooperative multi-agent system, with each key module—query rewriting, retrieval, document selection, and answer generation—treated as an individual reinforcement learning agent. The design addresses the central challenge of misaligned module-level objectives in traditional pipelines, where supervised fine-tuning of isolated components often fails to maximize end-to-end answer accuracy. Instead, MMOA-RAG uses multi-agent reinforcement learning (MARL), specifically multi-agent proximal policy optimization (MAPPO), to harmonize all agent objectives under a unified reward tied directly to the QA target metric (F1 score). This paradigm enables robust, adaptive optimization across the retrieval-augmented generation workflow.

1. Architectural Formulation and Agent Interaction

MMOA-RAG implements a full RAG stack as a sequence of interacting agents: the Query Rewriter, Selector, and Generator are treated as parameterized RL agents, while the Retriever is regarded as part of the static environment (i.e., a non-learning dense retriever using vector search). The workflow is defined as follows:

The Query Rewriter processes the initial question $q$ (given prompt $Prompt_{\text{QR}}$ ) and produces sub-questions $subq = \text{QR}(q, Prompt_{\text{QR}})$ to enhance retrievability and disambiguate complex queries.
The Retriever receives $subq$ and performs dense retrieval over the external corpus, returning $K$ candidate documents.
The Selector, based on the original question, prompt $Prompt_{\text{S}}$ , and candidate set $D$ , outputs a refined subset $D_{\text{selected}} = \text{S}(q, Prompt_{\text{S}}, D)$ , focusing the context for generation.
The Generator then produces the final answer from $D_{\text{selected}}, q, Prompt_{\text{G}}$ .

Each agent’s observation $O_i$ , action space $A_i$ , and reward function $R_i$ are defined for the MARL setup, with explicit structure ensuring that the output of one module is the input to the next.

2. Joint Multi-Agent Reinforcement Learning Optimization

Optimization uses MAPPO, with all agents sharing a global reward that directly reflects the answer quality (e.g., final F1 score) augmented by local penalty regularizers:

$R_i = R_{\text{shared}} + P_i$

where $R_{\text{shared}}$ is the task's overall F1 score and $P_i$ penalizes undesirable behaviors such as excessive sub-question production, redundant document selection, or overly verbose answers.

The training loss combines actor and critic losses:

$\mathcal{L}(\theta, \phi) = \mathcal{L}_{\text{Actor}}(\theta) + \alpha \cdot \mathcal{L}_{\text{Critic}}(\phi)$

Actor loss for PPO is defined, summed over all agents and time steps:

$\mathcal{L}_{\text{Actor}}(\theta) = \sum_i \sum_t \min \Big[r_t^i \hat{A}_t^i, \text{clip}(r_t^i, 1-\epsilon, 1+\epsilon) \hat{A}_t^i\Big]$

with the importance sampling ratio $r_t^i$ :

$r_t^i = \frac{\pi_{\theta}(a_t^i | s_t^i)}{\pi_{\theta_{\text{old}}}(a_t^i | s_t^i)}$

and generalized advantage estimation (GAE) for $\hat{A}_t^i$ :

$\hat{A}_t^i = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^i, \qquad \delta_{t}^i = R(s_{t}^i, a_t^i) + \gamma V_{\phi}(s_{t+1}^i) - V_{\phi}(s_{t}^i)$

Terminal step rewards further include a divergence penalty ensuring the RL policy does not drift excessively from warm-start behavior initialized by supervised fine-tuning.

3. Empirical Evaluation and Ablation Studies

MMOA-RAG was evaluated on several QA datasets (HotpotQA, 2WikiMultiHopQA, AmbigQA) against a comprehensive battery of baselines: vanilla LLM, standard RAG (with and without SFT), and recent multi-component alternatives (SELF-RAG, RetRobust, Rewrite-Retrieve-Read, and BGM). Across all benchmarks, MMOA-RAG achieved superior F1, EM, and accuracy:

On HotpotQA: F1 improved by $\sim$ +1.80 over best baseline.
On 2WikiMultiHopQA and other multi-hop settings: gains were sustained or increased, especially where multi-hop reasoning amplifies the benefits of module coordination.

Ablation experiments confirm that full three-agent optimization (QR+S+G) yields the best results; removing any module degrades system performance, and two-agent (e.g., S+G) variants provide intermediate improvement. Additionally, generalization to new domain/architectural configurations was validated.

4. Adaptability and Modular Extension

The MMOA-RAG design supports adaptation to pipelines with non-standard components. By modelling each as an RL agent, e.g., introducing rerankers, more granular selectors, or multi-hop expansion modules, the system can be extended without changes to the central MARL mechanism. The framework accommodates removal or substitution of individual modules (e.g., two-agent versions) and demonstrates consistent gains even when the Retriever backend is swapped or the Generator is replaced.

Penalties are modular: for example, $P_{\text{QR}}$ for number of sub-questions, $P_{\text{S}}$ for duplicate/irrelevant selection, and $P_{\text{G}}$ for generation verbosity. This allows for system tuning under varying operational constraints or dataset idiosyncrasies.

5. Implementation Details and Resource Requirements

The implementation is available at https://github.com/chenyiqun/MMOA-RAG. The code includes:

Supervised warm-start procedures for agent initialization.
Contriever retrieval model configuration.
Shared parameter mechanisms to efficiently run multiple agents on a single LLM instance (critical for resource-limited scenarios).
MAPPO training routines with user-settable hyperparameters (learning rate, batch, clip range, buffer, etc.).

Computational requirements are in line with typical RL-based LLM pipelines—single or distributed GPU nodes, with agent parameter sharing for reduced memory overhead. Training stability benefits from actor-critic separation and reward clipping.

6. Comparative Advantages and State-of-the-Art Claims

MMOA-RAG establishes a new state-of-the-art among modular RAG optimization strategies. By addressing inter-module misalignment and enforcing cooperation toward the final answer, it achieves not only absolute gains in answer F1 but also superior robustness, generalizability to novel queries or domains, and performance in out-of-domain settings—outperforming both established supervised and modern RL-based baselines.

A notable practical implication is that the MARL formulation enables systematic troubleshooting: contributions of individual modules are directly measurable, and the framework supports granular hyperparameter and penalty optimization depending on desired trade-offs between answer quality, computational budget, and response latency.

7. Summary Table: MMOA-RAG Functional Modules and Optimization Strategy

Agent/Module	Input	Output	RL Optimization	Penalty Term
Query Rewriter	q, Prompt $_{\text{QR}}$	sub-questions	Yes	Excessive sub-questions
Retriever	sub-questions	candidate docs	No	—
Selector	q, Prompt $_{\text{S}}$ , docs	selected doc subset	Yes	Duplicates, verbosity
Generator	q, Prompt $_{\text{G}}$ , docs	predicted answer	Yes	Output length, format

All RL agents are optimized with a global F1-reward and local penalties.

MMOA-RAG establishes a principled, end-to-end learning mechanism for complex RAG pipelines by leveraging multi-agent RL to coordinate all major modules. This MARL-driven joint optimization directly aligns the objective of every agent with the ultimate quality of model outputs, thus robustly boosting retrieval-augmented QA performance across a range of benchmarks and configurations (Chen et al., 25 Jan 2025).

PDF Markdown Chat (Pro)

References (1)

Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning (2025)

Follow Topic

Get notified by email when new papers are published related to MMOA-RAG Framework.