Learning and Optimizing a Human Utility Function for Shutdown Acceptance

Establish practical methods for specifying or learning a human utility function that an AI agent can optimize, thereby enabling shutdown acceptance behavior derived from uncertainty over human preferences as in the Off-Switch Game framework.

Background

The Off-Switch Game demonstrates that an AI agent will accept shutdown if it optimizes human utility and is uncertain about that utility; a shutdown request then provides information indicating shutdown is optimal.

However, this result presupposes that the agent is already optimizing a human utility function. The authors explicitly note that achieving such optimization is a difficult open problem, motivating approaches that allow corrigibility independent of solving full human utility optimization.

References

This outcome depends on the AI agent already trying to optimize a human’s utility function, a difficult open problem, while our corrigibility transformation can be applied to any goal.

— Corrigibility Transformation: Constructing Goals That Accept Updates (2510.15395 - Hudson, 17 Oct 2025) in Related Work

Learning and Optimizing a Human Utility Function for Shutdown Acceptance

Background

References

Related Problems