Learning and Optimizing a Human Utility Function for Shutdown Acceptance
Establish practical methods for specifying or learning a human utility function that an AI agent can optimize, thereby enabling shutdown acceptance behavior derived from uncertainty over human preferences as in the Off-Switch Game framework.
References
This outcome depends on the AI agent already trying to optimize a human’s utility function, a difficult open problem, while our corrigibility transformation can be applied to any goal.
— Corrigibility Transformation: Constructing Goals That Accept Updates
(2510.15395 - Hudson, 17 Oct 2025) in Related Work