- The paper proposes an ensemble-based framework leveraging diverse reward shaping to enhance zero-shot coordination in multi-agent reinforcement learning.
- It evaluates four selection strategies, including LLM-based and surrogate network approaches, to optimize policy diversity in the Overcooked environment.
- Empirical results demonstrate 60–100% performance improvements over baselines, highlighting the impact of reward diversity on coordination efficiency.
Zero-Shot Coordination for Sparse Reward Tasks with Diverse Reward Shapings
Introduction
This paper addresses the limitations of traditional Zero-Shot Coordination (ZSC) in Multi-Agent Reinforcement Learning (MARL), particularly the assumption that cooperating agents share identical reward functions, including both sparse objectives and dense reward shaping. ZSC conventionally focuses on learning robust agents that generalize to unknown partners trained under the same reward structure. However, in practical deployments, reward shaping schemes are almost always diverse across developers, training sites, or even runs. This diversity is rarely considered, despite its potential to induce incompatible conventions. The authors propose and systematically evaluate methods for training agents capable of effective zero-shot coordination in the presence of diverse reward shapings, using the Overcooked environment as a benchmark.

Figure 1: Three Overcooked environments of varying coordination demands, illustrating the complexity of evaluating agent generalization under varied reward shapings.
The work formalizes the problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) and adapts the classic cross-play (XP) loss as the ZSC performance metric. The key challenge is to produce policy populations that coordinate effectively with policies derived from differing reward shapings—even if the underlying sparse objective remains the same. In Overcooked, reward shaping is parameterized as a vector of weights on key environmental features (e.g., placements, pickups, proximity distances).
Reward Shaping Diversity: Selection Methods
The primary contribution is an ensemble-based framework for learning ZSC-capable agents using populations trained under parametrically diverse reward shapings. Four selection strategies for shaping vectors are analyzed:
- LLM-Based Selection: A prompt-driven approach using a LLM (Claude Sonnet 4.5) to propose reward shaping sets, conditioned on past training outcomes and a diversity requirement.
- Surrogate Network: A supervised MLP trained on prior shaping-performance pairs to predict high-performing shapings, further screening for diversity among top selections.
- Stratified Grid: Latin Hypercube Sampling stratifies the reward shaping space, maximizing coverage and variance across features.
- Random Selection: Uniform random sampling of shaping weights across the permissible range.
The resultant population of Base Response (BR) agents for each shaping set is ensembled via mode voting over individual agent actions during test-time coordination.
Experimental Framework
The Overcooked environment is used for benchmarking, focusing on three layouts—Random0_Medium, Random3, and Unident_S—spanning the spectrum from loosely to tightly coupled cooperative dynamics. The evaluation regime consists of training ensembles (and baselines) for 100 million steps using Trajectory Diversity (TrajeDi) or baseline MARL algorithms. Zero-shot testing measures average cross-play reward with partners implemented via MAPPO policies trained under random, unknown reward shapings.
Empirical Results
Comprehensive experiments demonstrate that all proposed selection methods—particularly Surrogate Network and Stratified Grid—yield consistent and substantial gains in both sparse and shaped rewards compared to single-shaping and existing ZSC baselines across all Overcooked layouts.
Figure 2: Sparse reward learning curves during training for each selection method, showing diverse performance trajectories across different shaping-driven BR populations.
Quantitatively, these gains often exceed 60–100% relative improvement in sparse reward over baseline ZSC architectures. Stratified Grid yields the highest coverage of the shaping space and achieves the highest mean test reward; Surrogate Network, despite generating less diverse shapings (lower average standard deviation), performs nearly as well, indicating that coverage per se is not always the dominant factor when high-performing shaping clusters are discoverable.
The ensembled baseline (multiple BRs trained under the same shaping) also outperforms single-agent baselines, but is always significantly outperformed by ensembles that incorporate shaping diversity, highlighting the necessity of reward-level diversity for ZSC robustness.
Analysis of Shaping Diversity
An explicit ablation of the shaping selection methods reveals that high diversity (as measured by coverage and standard deviation across features) is beneficial but not sufficient; optimal performance results from a balance between representative coverage and selection of reward shapings likely to generate performant policies. LLM-based and Surrogate Network approaches, which explicitly or implicitly leverage training data to bias selections toward high-performing but distinct behavioral modes, are competitive with or superior to pure coverage maximization.
Qualitative LLM prompt analysis confirms that expert supervision (via prompting for diversity and informed priors) enhances policy variety while preserving task competence.
Implications and Future Directions
This work presents a robust, scalable methodology for training agents capable of zero-shot coordination with partners optimized for distinct reward shapings. It challenges the standard MARL/ZSC assumption of reward congruence and introduces techniques suited for more realistic, heterogeneous-agent multi-agent systems. Practically, this augments the reliability of coordination in domains where reward engineering heterogeneity is inevitable (e.g., human-AI teaming, cross-vendor autonomy).
On the theoretical front, this study invites reconsideration of diversity metrics in policy and reward space as joint objectives for robust generalization, and motivates deeper study of the interplay between reward shaping coverage and population behavioral diversity.
Potential future directions include integrating explicit diversity-promoting regularizers (as suggested for the Surrogate Network), application to richer continuous and high-dimensional environments, and joint optimization over shaping selection and policy learning loops.
Conclusion
Through an ensemble-based methodology leveraging shaping diversity, this paper demonstrates that ZSC in MARL can be made robust to heterogeneity in reward shaping—an essential consideration for real-world deployment. The primary technical finding is that population-based training over diverse shaping spaces yields dominant zero-shot cross-play performance, with the best methods combining diversity sampling with reward-informed selection. This line of work establishes a foundation for future research in heterogeneous multi-agent generalization and coordination.