Effect of approval-construction on MONA’s safety guarantees

Determine how the construction of the overseer approval function in MONA—particularly the extent to which approval depends on achieved outcomes—affects whether MONA’s safety guarantees against multi-step reward hacking hold.

Background

MONA combines myopic optimization with a non-myopic approval signal to mitigate multi-step reward hacking. The safety of this approach depends critically on how approval is constructed, especially how much it relies on realized outcomes versus overseer predictions.

This paper highlights that the original MONA work explicitly raised this as an open question and centers its experimental extension around operationalizing and probing this dependency via oracle, noisy, misspecified, and learned approval mechanisms.

References

The original paper identifies a critical open question: how the method of constructing approval---particularly the degree to which approval depends on achieved outcomes---affects whether MONA's safety guarantees hold.