- The paper presents mmGRPOColor, a generalization of GRPO for multi-module LM programs that integrates policy gradient updates and prompt optimization.
- It achieves significant performance gains over baselines, with improvements up to 11 points when combined with prompt optimization.
- The modular design enables independent module updates via dynamic credit assignment and diversity-enhanced group formation, facilitating scalable deployment.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for LLM Programs
Introduction
The paper presents mmGRPOColor, a generalization of Group Relative Policy Optimization (GRPO) for modular LLM (LM) programs. Modern NLP systems increasingly employ modular architectures, where multiple LM calls—each with distinct prompt templates and control logic—are orchestrated to solve complex tasks. While GRPO has shown efficacy in single-stage LM fine-tuning, its extension to multi-module programs is non-trivial due to variable-length trajectories, interrupted executions, and disjoint intermediate states. The authors address these challenges by introducing mmGRPOColor, which enables policy gradient updates at the module level, and demonstrate its effectiveness both standalone and in combination with prompt optimization (PO) via the BetterTogether framework.
GRPO and Its Multi-module Extension
GRPO is an online policy gradient method that operates over groups of trajectories sharing the same input prompt. The objective upweights high-reward completions within a group, with PPO-style clipping and KL regularization for stability. In the multi-module setting, mmGRPOColor relaxes the requirement for shared inputs by grouping rollouts at the module level, aligning structurally comparable module calls across different trajectories. This allows for independent policy updates for each module, even when module-level inputs differ due to upstream context or control flow divergence.
The mmGRPOColor loss is applied to each module-level group, updating only the LM weights of the corresponding module. Credit assignment is handled by propagating the final program-level reward to all module invocations within a trajectory, enabling learning without intermediate supervision. The algorithm supports sampling from both student and teacher programs, facilitating warm-starts and off-policy training.
Experimental Setup
Experiments are conducted on three LM program tasks: intent classification (Banking77), multi-hop claim verification (HoVer), and privacy-conscious delegation (PAPILLON). Each task involves distinct reasoning styles and control flows, with programs implemented in DSPy and evaluated using open-source LMs: llama3.1-8b-instruct and qwen3-8b. Baselines include vanilla Chain-of-Thought (CoT) prompting and MIPROv2, a state-of-the-art prompt optimizer. mmGRPOColor is evaluated both standalone and in combination with PO via BetterTogether, where prompt templates are first optimized and then weights are fine-tuned.
Training employs LoRA for efficient adaptation, with context lengths up to 8,192 tokens and batch sizes tailored to each LM. Rollouts are dynamically generated, and module-level GRPO groups are constructed by aligning module invocations across trajectories. Diversity within groups is maximized to improve generalization, following recent findings on variance-based selection.
Results and Analysis
mmGRPOColor consistently improves over vanilla CoT by an average of 7 points across tasks and models. BetterTogether(PO, mmGRPOColor) further improves by 11 over the unadapted baseline, 5 over MIPROv2, and 3 over standalone mmGRPOColor. Notably, MIPROv2 achieves competitive gains with significantly lower computational cost—1.4 GPU-hours versus 18.7 for mmGRPOColor—highlighting the practicality of PO for resource-constrained settings.
The combination of PO and policy gradient RL yields the strongest overall performance, validating the complementary nature of prompt and weight optimization in modular LM programs. The results demonstrate that high-quality rollouts from PO provide a robust training signal for subsequent RL-based weight tuning, especially in online settings.
Implementation Considerations
- Modularity: mmGRPOColor is designed to be modular, allowing independent updates to each LM module. In practice, the same LM weights are often shared for deployment efficiency.
- Credit Assignment: Uniform credit assignment via final program-level reward is effective, but more granular reward propagation could further improve learning in complex programs.
- Diversity in Groups: Maximizing reward variance within GRPO groups enhances generalization, suggesting that future implementations should incorporate diversity-promoting sampling strategies.
- Resource Requirements: LoRA-based fine-tuning is efficient but may limit performance compared to full-parameter updates. Scaling to larger models and longer contexts remains an open challenge.
- Deployment: The open-source DSPy implementation (dspy.GRPO) facilitates integration into arbitrary compound AI systems, supporting both online and offline training regimes.
Limitations
The paper is limited to 8B-parameter LMs and LoRA-based adaptation. Only one mmGRPOColor formulation is evaluated, and the classification task is studied in a limited-feedback setting without supervised labels. The generalizability to larger models, alternative RL formulations, and tasks with richer feedback remains to be explored.
Implications and Future Directions
The work establishes a strong baseline for online RL in modular LM programs, demonstrating that policy gradient methods can be effectively composed with prompt optimization. The findings suggest that future research should explore more sophisticated credit assignment mechanisms, scaling to larger models, and integration with offline RL techniques. The modularity of mmGRPOColor opens avenues for privacy-preserving delegation, context management in RAG pipelines, and interpretable program optimization.
Further investigation into diversity-promoting group formation, adaptive reward propagation, and hybrid optimization strategies could yield additional gains. The open-source release in DSPy provides a foundation for community-driven experimentation and deployment in real-world AI systems.
Conclusion
mmGRPOColor extends GRPO to multi-module LM programs, enabling effective online weight optimization via module-level policy gradients. The approach consistently outperforms standard baselines and, when combined with prompt optimization, achieves the best overall results. The complementary relationship between prompt and weight optimization is validated in the online RL setting, and the modular design facilitates practical deployment in complex AI pipelines. Future work should address scaling, alternative RL formulations, and richer feedback mechanisms to further advance the optimization of modular LM programs.