Cause of SMALL’s occasional outperformance over oracle safe MARL methods
Determine whether the occasional outperformance of SMALL-based algorithms (SMALL-MAPPO and SMALL-HAPPO) relative to safe multi-agent reinforcement learning baselines that use ground-truth cost functions (MAPPO-Lagrange and HAPPO-Lagrange) is caused by the SMALL cost learning module misclassifying certain high-risk yet reward-improving actions (e.g., navigating near hazards) as acceptable non-violations.
References
It is noteworthy that our algorithms occasionally outperform others. We conjecture that this may stem from the cost prediction module incorrectly considering certain high-risk yet potentially beneficial actions (such as navigating close to hazardous zones) as acceptable.
— Safe Multi-agent Reinforcement Learning with Natural Language Constraints
(2405.20018 - Wang et al., 30 May 2024) in Section Experiments, Ablation Study, Comparison with algorithms that use the ground truth cost (around Figure 4)