Constructing Corrigible Goals That Do Not Incentivize Avoiding Updates or Shutdown
Construct goal specifications for AI agents that do not incentivize ignoring, avoiding, or sabotaging proper goal updates or shutdown, even when the specific mechanisms by which the agent could prevent or interfere with updates cannot be fully enumerated or specified in advance.
References
The issue in this example is that an AI cannot achieve its goals when shut down, so it is incentivized to ignore, avoid, or sabotage shutdown requests. The open problem of corrigibility is to construct goals that do not incentivize such behavior even when we cannot specify all the ways the AI might do so.
                — Corrigibility Transformation: Constructing Goals That Accept Updates
                
                (2510.15395 - Hudson, 17 Oct 2025) in Introduction