Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs (2508.04660v1)

Published 6 Aug 2025 in cs.CL

Abstract: Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training LMs. However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems. We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own. We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents mmGRPOColor, a generalization of GRPO for multi-module LM programs that integrates policy gradient updates and prompt optimization.
  • It achieves significant performance gains over baselines, with improvements up to 11 points when combined with prompt optimization.
  • The modular design enables independent module updates via dynamic credit assignment and diversity-enhanced group formation, facilitating scalable deployment.

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for LLM Programs

Introduction

The paper presents mmGRPOColor, a generalization of Group Relative Policy Optimization (GRPO) for modular LLM (LM) programs. Modern NLP systems increasingly employ modular architectures, where multiple LM calls—each with distinct prompt templates and control logic—are orchestrated to solve complex tasks. While GRPO has shown efficacy in single-stage LM fine-tuning, its extension to multi-module programs is non-trivial due to variable-length trajectories, interrupted executions, and disjoint intermediate states. The authors address these challenges by introducing mmGRPOColor, which enables policy gradient updates at the module level, and demonstrate its effectiveness both standalone and in combination with prompt optimization (PO) via the BetterTogether framework.

GRPO and Its Multi-module Extension

GRPO is an online policy gradient method that operates over groups of trajectories sharing the same input prompt. The objective upweights high-reward completions within a group, with PPO-style clipping and KL regularization for stability. In the multi-module setting, mmGRPOColor relaxes the requirement for shared inputs by grouping rollouts at the module level, aligning structurally comparable module calls across different trajectories. This allows for independent policy updates for each module, even when module-level inputs differ due to upstream context or control flow divergence.

The mmGRPOColor loss is applied to each module-level group, updating only the LM weights of the corresponding module. Credit assignment is handled by propagating the final program-level reward to all module invocations within a trajectory, enabling learning without intermediate supervision. The algorithm supports sampling from both student and teacher programs, facilitating warm-starts and off-policy training.

Experimental Setup

Experiments are conducted on three LM program tasks: intent classification (Banking77), multi-hop claim verification (HoVer), and privacy-conscious delegation (PAPILLON). Each task involves distinct reasoning styles and control flows, with programs implemented in DSPy and evaluated using open-source LMs: llama3.1-8b-instruct and qwen3-8b. Baselines include vanilla Chain-of-Thought (CoT) prompting and MIPROv2, a state-of-the-art prompt optimizer. mmGRPOColor is evaluated both standalone and in combination with PO via BetterTogether, where prompt templates are first optimized and then weights are fine-tuned.

Training employs LoRA for efficient adaptation, with context lengths up to 8,192 tokens and batch sizes tailored to each LM. Rollouts are dynamically generated, and module-level GRPO groups are constructed by aligning module invocations across trajectories. Diversity within groups is maximized to improve generalization, following recent findings on variance-based selection.

Results and Analysis

mmGRPOColor consistently improves over vanilla CoT by an average of 7 points across tasks and models. BetterTogether(PO, mmGRPOColor) further improves by 11 over the unadapted baseline, 5 over MIPROv2, and 3 over standalone mmGRPOColor. Notably, MIPROv2 achieves competitive gains with significantly lower computational cost—1.4 GPU-hours versus 18.7 for mmGRPOColor—highlighting the practicality of PO for resource-constrained settings.

The combination of PO and policy gradient RL yields the strongest overall performance, validating the complementary nature of prompt and weight optimization in modular LM programs. The results demonstrate that high-quality rollouts from PO provide a robust training signal for subsequent RL-based weight tuning, especially in online settings.

Implementation Considerations

  • Modularity: mmGRPOColor is designed to be modular, allowing independent updates to each LM module. In practice, the same LM weights are often shared for deployment efficiency.
  • Credit Assignment: Uniform credit assignment via final program-level reward is effective, but more granular reward propagation could further improve learning in complex programs.
  • Diversity in Groups: Maximizing reward variance within GRPO groups enhances generalization, suggesting that future implementations should incorporate diversity-promoting sampling strategies.
  • Resource Requirements: LoRA-based fine-tuning is efficient but may limit performance compared to full-parameter updates. Scaling to larger models and longer contexts remains an open challenge.
  • Deployment: The open-source DSPy implementation (dspy.GRPO) facilitates integration into arbitrary compound AI systems, supporting both online and offline training regimes.

Limitations

The paper is limited to 8B-parameter LMs and LoRA-based adaptation. Only one mmGRPOColor formulation is evaluated, and the classification task is studied in a limited-feedback setting without supervised labels. The generalizability to larger models, alternative RL formulations, and tasks with richer feedback remains to be explored.

Implications and Future Directions

The work establishes a strong baseline for online RL in modular LM programs, demonstrating that policy gradient methods can be effectively composed with prompt optimization. The findings suggest that future research should explore more sophisticated credit assignment mechanisms, scaling to larger models, and integration with offline RL techniques. The modularity of mmGRPOColor opens avenues for privacy-preserving delegation, context management in RAG pipelines, and interpretable program optimization.

Further investigation into diversity-promoting group formation, adaptive reward propagation, and hybrid optimization strategies could yield additional gains. The open-source release in DSPy provides a foundation for community-driven experimentation and deployment in real-world AI systems.

Conclusion

mmGRPOColor extends GRPO to multi-module LM programs, enabling effective online weight optimization via module-level policy gradients. The approach consistently outperforms standard baselines and, when combined with prompt optimization, achieves the best overall results. The complementary relationship between prompt and weight optimization is validated in the online RL setting, and the modular design facilitates practical deployment in complex AI pipelines. Future work should address scaling, alternative RL formulations, and richer feedback mechanisms to further advance the optimization of modular LM programs.

Youtube Logo Streamline Icon: https://streamlinehq.com