Monotonic safety degradation conjecture with outcome-dependent approval

Establish whether MONA’s safety degrades monotonically as the approval function becomes more dependent on achieved outcomes, thereby moving along the approval-construction spectrum toward outcome dependence.

Background

Appendix B.3 of the MONA paper describes a spectrum of approval constructions ranging from restrictive, prediction-based approval to outcome-dependent approval that approaches ordinary RL. Within this spectrum, the original work conjectures a monotonic relationship between outcome dependence and safety degradation.

This paper restates that conjecture and builds experimental tools intended to begin empirical characterization of this spectrum via learned approval proxies.

References

The paper conjectures that safety degrades monotonically as approval moves toward the outcome-dependent end, but leaves empirical characterization of this spectrum to future work.

— Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation (2603.29993 - Heath, 31 Mar 2026) in Paragraph “The approval-spectrum conjecture,” Section 4 (Background: MONA and Camera Dropbox)

Monotonic safety degradation conjecture with outcome-dependent approval

Background

References

Related Problems