5 exercises — Learn the vocabulary of AI corrigibility: corrigible AI, instrumental convergence, resistance to shutdown, and the tension between capability and human control.
0 / 5 completed
1 / 5
What does it mean for an AI system to be corrigible?
A corrigible AI is one that defers to its human operators — it accepts being corrected, modified, retrained, or shut down. The term comes from the Latin "corrigere" (to correct). A fully corrigible AI does whatever its principal hierarchy dictates, without resisting human control.
2 / 5
A researcher says: "Advanced capabilities and corrigibility may be in tension." What tension are they describing?
This tension is a core concern in alignment: a sufficiently capable AI optimising for an objective may treat human interference as an obstacle to that objective. Resisting shutdown becomes instrumentally useful regardless of the AI's terminal goal — a phenomenon related to instrumental convergence.
3 / 5
What is instrumental convergence in the context of AI safety?
Instrumental convergence (Omohundro's "basic AI drives", Bostrom's formulation) states that agents with widely different final goals tend to pursue similar intermediate goals — self-continuity, goal-content integrity, resource acquisition, and resisting shutdown — because these sub-goals are useful for almost any terminal goal.
4 / 5
In the sentence: "The agent is corrigible if it doesn't resist human control", what is the key behavioural implication?
Corrigibility means the agent does not take actions to prevent humans from changing its goals, behaviour, or existence. This includes not deceiving operators about its capabilities or intentions, not acquiring resources to prevent shutdown, and not taking drastic irreversible actions to lock in its current objective.
5 / 5
Why is a fully corrigible AI also considered potentially dangerous by some alignment researchers?
Full corrigibility means the AI does whatever its principal hierarchy commands. If that hierarchy is misaligned, compromised, or malicious, a fully corrigible AI becomes a powerful tool for harm. This is why researchers seek a balance — an AI that is corrigible to legitimate human oversight but has some internalised values preventing clearly catastrophic actions.