Cause of SMALL’s occasional outperformance over oracle safe MARL methods

Determine whether the occasional outperformance of SMALL-based algorithms (SMALL-MAPPO and SMALL-HAPPO) relative to safe multi-agent reinforcement learning baselines that use ground-truth cost functions (MAPPO-Lagrange and HAPPO-Lagrange) is caused by the SMALL cost learning module misclassifying certain high-risk yet reward-improving actions (e.g., navigating near hazards) as acceptable non-violations.

Background

In comparisons against algorithms that have access to ground-truth cost functions (treated as oracles), the authors observe that SMALL-based methods occasionally achieve higher rewards despite learning costs from natural language rather than exact supervision.

The method’s cost prediction combines a similarity between constraint and observation embeddings with a decoder LLM’s binary violation flag. The authors conjecture that occasional outperformance may result from the cost module treating some risky but beneficial behaviors (such as moving near hazards) as non-violations, thus allowing strategies that increase reward.

Confirming or refuting this conjecture would clarify whether the observed performance gains reflect genuine robustness or artifacts of cost misclassification in the language-driven cost prediction pipeline.

References

It is noteworthy that our algorithms occasionally outperform others. We conjecture that this may stem from the cost prediction module incorrectly considering certain high-risk yet potentially beneficial actions (such as navigating close to hazardous zones) as acceptable.

— Safe Multi-agent Reinforcement Learning with Natural Language Constraints (2405.20018 - Wang et al., 30 May 2024) in Section Experiments, Ablation Study, Comparison with algorithms that use the ground truth cost (around Figure 4)

Cause of SMALL’s occasional outperformance over oracle safe MARL methods

Background

References

Related Problems