Effect of approval-construction on MONA’s safety guarantees
Determine how the construction of the overseer approval function in MONA—particularly the extent to which approval depends on achieved outcomes—affects whether MONA’s safety guarantees against multi-step reward hacking hold.
References
The original paper identifies a critical open question: how the method of constructing approval---particularly the degree to which approval depends on achieved outcomes---affects whether MONA's safety guarantees hold.
— Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation
(2603.29993 - Heath, 31 Mar 2026) in Abstract