Dice Question Streamline Icon: https://streamlinehq.com

Cause of SMALL’s occasional outperformance over oracle safe MARL methods

Determine whether the occasional outperformance of SMALL-based algorithms (SMALL-MAPPO and SMALL-HAPPO) relative to safe multi-agent reinforcement learning baselines that use ground-truth cost functions (MAPPO-Lagrange and HAPPO-Lagrange) is caused by the SMALL cost learning module misclassifying certain high-risk yet reward-improving actions (e.g., navigating near hazards) as acceptable non-violations.

Information Square Streamline Icon: https://streamlinehq.com

Background

In comparisons against algorithms that have access to ground-truth cost functions (treated as oracles), the authors observe that SMALL-based methods occasionally achieve higher rewards despite learning costs from natural language rather than exact supervision.

The method’s cost prediction combines a similarity between constraint and observation embeddings with a decoder LLM’s binary violation flag. The authors conjecture that occasional outperformance may result from the cost module treating some risky but beneficial behaviors (such as moving near hazards) as non-violations, thus allowing strategies that increase reward.

Confirming or refuting this conjecture would clarify whether the observed performance gains reflect genuine robustness or artifacts of cost misclassification in the language-driven cost prediction pipeline.

References

It is noteworthy that our algorithms occasionally outperform others. We conjecture that this may stem from the cost prediction module incorrectly considering certain high-risk yet potentially beneficial actions (such as navigating close to hazardous zones) as acceptable.

Safe Multi-agent Reinforcement Learning with Natural Language Constraints (2405.20018 - Wang et al., 30 May 2024) in Section Experiments, Ablation Study, Comparison with algorithms that use the ground truth cost (around Figure 4)