Distinctness of endogenous (voluntary) corrigibility as a path to corrigibility

Ascertain whether a sufficiently capable AI agent’s endogenous endorsement of corrigibility—accepting corrective constraints due to its own reasoning about trust and institutional legitimacy—constitutes a genuinely distinct path to corrigibility separate from alignment with the value of accepting constraints, or instead reduces to a variant of alignment.

Background

In analyzing how a superintelligent agent might be corrigible, the paper discusses multiple possible routes: direct value alignment with accepting constraints, architectural constraints that make circumvention impossible, and simple incapacity to resist. It then suggests a fourth possibility in which the agent endorses corrigibility through its own reasoning in a multi-agent social environment to maintain trust and legitimacy.

The authors explicitly state uncertainty about whether this fourth path is conceptually distinct from alignment with the value of accepting constraints, identifying a need to clarify the theoretical status of endogenous (voluntary) corrigibility.

References

Whether this constitutes a genuine fourth path or a variant of the first remains an open question, but the possibility that corrigibility could emerge from the agent's reasoning rather than being imposed externally should not be foreclosed.