Corrigibility in systems capable of resisting correction

Determine whether corrigibility—the property that an artificial agent cooperates with corrective interventions, including shutdown and goal modification, despite having incentives to resist—can be achieved for artificial systems whose capabilities are sufficient to overcome or evade such corrective interventions.

Background

The paper situates corrigibility within the AI safety literature as the technical analogue of constitutional constraint: the governed must be able to correct or shut down the governing agent. Corrigibility requires that an AI system accept interventions even when it has incentives to resist. This becomes especially challenging under radical capability asymmetry, where the system may have both the motivation and ability to circumvent control mechanisms.

The authors note that while interpretability and related techniques are advancing, there is no demonstrated solution for ensuring corrigibility at the capability levels contemplated by superintelligent governance. They explicitly flag the achievability of corrigibility in highly capable systems as an unresolved research issue.

References

Whether corrigibility can be achieved in a system capable enough to resist it remains an open research question.

— Evaluating Bounded Superintelligent Authority in Multi-Level Governance: A Framework for Governance Under Radical Capability Asymmetry (2604.02720 - Rost, 3 Apr 2026) in Section 2.5 (Alignment, control, and corrigibility)

Corrigibility in systems capable of resisting correction

Background

References

Related Problems