Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs (2508.04660v1)

Published 6 Aug 2025 in cs.CL

Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training LMs. However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

Summary

The paper introduces mmGRPOColor, a multi-module extension of GRPO that addresses the credit assignment challenge in modular LM programs.
It combines prompt optimization with policy gradient weight tuning in a staged approach, achieving notable performance gains.
Experimental results demonstrate improvements of up to 11 points over baselines across diverse tasks, validating the method's effectiveness.

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for LLM Programs

Introduction

The paper presents mmGRPOColor, a generalization of Group Relative Policy Optimization (GRPO) for modular LLM (LM) programs. Modern NLP systems increasingly employ modular architectures, where multiple LM modules, each with distinct prompt templates and control flows, are orchestrated to solve complex tasks. While GRPO has demonstrated efficacy in single-stage LM fine-tuning, its extension to multi-module programs is non-trivial due to variable-length trajectories, interrupted executions, and disjoint intermediate states. The authors address these challenges by introducing mmGRPOColor, which enables policy gradient optimization at the module level, and demonstrate its integration with prompt optimization techniques for improved performance.

GRPO and Its Multi-module Extension

GRPO is an online policy gradient method that operates over groups of trajectories sharing the same input prompt. The objective upweights high-reward completions within a group, with PPO-style clipping and KL regularization to stabilize updates. In the single-module setting, each group consists of completions from a single LM call. The extension to multi-module programs requires grouping rollouts at the module level, aligning structurally comparable module calls across different trajectories, and handling variable invocation counts and control flow divergences.

mmGRPOColor samples full program trajectories, aligns module calls across these trajectories, and forms GRPO groups for each module and relative invocation order. Each group contains input-output-reward triples for a specific module, with rewards propagated from the final program output. The GRPO loss is applied independently to each group, updating only the LM weights of the corresponding module. This design accommodates shared LM weights across modules and supports flexible training setups, including warm-starting from prompt-optimized programs or learning from teacher policies.

Composing Policy Gradients with Prompt Optimization

The paper leverages the BetterTogether framework to combine prompt optimization (PO) and online RL via mmGRPOColor. Specifically, prompt templates are first optimized using MIPROv2, a state-of-the-art prompt optimizer, and then mmGRPOColor is applied to the prompt-optimized program for weight tuning. This staged approach is shown to yield higher performance than either method alone, indicating complementary benefits of prompt and weight optimization in modular LM programs.

Experimental Evaluation

Experiments are conducted on three diverse LM program tasks: intent classification (Banking77), multi-hop claim verification (HoVer), and privacy-conscious delegation (PAPILLON). Two open-source LMs, llama3.1-8b-instruct and qwen3-8b, are used for evaluation. The DSPy framework and its RL training library, Arbor, are employed for implementation.

Key findings include:

mmGRPOColor improves accuracy by 7 points on average over vanilla Chain-of-Thought (CoT) prompting.
BetterTogether(PO, mmGRPOColor) yields an 11-point improvement over the unadapted baseline, 5 points over MIPROv2 alone, and 3 points over mmGRPOColor alone.
MIPROv2 achieves competitive results with significantly lower computational cost (1.4 GPU-hours vs. 18.7 GPU-hours for mmGRPOColor).
Prompt optimization is preferable for low-resource settings, while the combination with mmGRPOColor is optimal for maximal performance.

The experiments validate that mmGRPOColor effectively propagates final rewards across disjoint modules, enabling robust credit assignment without intermediate supervision. The staged integration with prompt optimization further enhances training signal quality and downstream performance.

Implementation Details

The mmGRPOColor algorithm is implemented as an optimizer in the DSPy library. Training employs LoRA for efficient adaptation, with module-level GRPO groups formed by aligning module invocations across sampled trajectories. Padding and diversity selection strategies ensure uniform group sizes and reward variance, improving generalization. The approach supports sampling from teacher programs for off-policy training and is compatible with both single-module and multi-module LM programs.

Pseudocode for mmGRPOColor is provided, detailing the sampling of rollouts, formation of module-level groups, and independent application of the GRPO loss to each group. Hyperparameters such as learning rate, batch size, and group size are configurable, and the algorithm is designed for extensibility to alternative multi-module RL formulations.

Limitations

The paper is limited to 8-billion parameter LMs and LoRA-based fine-tuning, which may not generalize to larger models or full-parameter updates. Only one mmGRPOColor implementation is evaluated, and the classification task is studied in a limited-feedback setting without supervised labels. The results indicate that reward-based training does not yet match supervised performance on certain tasks.

Implications and Future Directions

The introduction of mmGRPOColor enables scalable online RL for modular LM programs, addressing the credit assignment problem in complex pipelines. The demonstrated complementarity of prompt and weight optimization suggests that future work should explore integrated optimization strategies in both offline and online settings. Potential directions include extending mmGRPOColor to larger models, investigating alternative module grouping strategies, and developing more efficient training algorithms for resource-constrained environments. The framework also opens avenues for privacy-preserving delegation and multi-hop reasoning in real-world AI systems.

Conclusion

mmGRPOColor generalizes GRPO to multi-module LM programs, enabling effective online weight optimization and robust credit assignment. Its integration with prompt optimization via BetterTogether yields superior performance across diverse tasks and models. The approach is open-sourced in DSPy, providing a practical tool for optimizing modular AI systems. The results underscore the value of combining policy gradient RL and prompt optimization, and motivate further research into scalable, integrated training methods for complex LM programs.

PDF Markdown

Follow-up Questions

Related Papers

Authors (13)

Tweets

https://twitter.com/dilarafsoylu/status/1953464927714226254

https://twitter.com/dilarafsoylu/status/1953464946076840444

https://twitter.com/dilarafsoylu/status/1953464937998934230

https://twitter.com/dilarafsoylu/status/1953464933837840831

https://twitter.com/dilarafsoylu/status/1953464930654695855

https://twitter.com/dilarafsoylu/status/1953464942906270000

alphaXiv

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs (16 likes, 0 questions)