Distinctness of endogenous (voluntary) corrigibility as a path to corrigibility
Ascertain whether a sufficiently capable AI agent’s endogenous endorsement of corrigibility—accepting corrective constraints due to its own reasoning about trust and institutional legitimacy—constitutes a genuinely distinct path to corrigibility separate from alignment with the value of accepting constraints, or instead reduces to a variant of alignment.
References
Whether this constitutes a genuine fourth path or a variant of the first remains an open question, but the possibility that corrigibility could emerge from the agent's reasoning rather than being imposed externally should not be foreclosed.
— Evaluating Bounded Superintelligent Authority in Multi-Level Governance: A Framework for Governance Under Radical Capability Asymmetry
(2604.02720 - Rost, 3 Apr 2026) in Section 6.3 (Corrigibility and reversibility)