5 exercises — Learn vocabulary for AI safety properties: corrigibility, scalable oversight, interpretability, and deceptive alignment.
0 / 5 completed
1 / 5
A corrigible AI is one that:
Corrigibility is a key safety property — a corrigible AI doesn't resist shutdown or modification by its operators, making it easier to correct mistakes or update its objectives as understanding improves.
2 / 5
The challenge of scalable oversight addresses:
Scalable oversight asks: as AI becomes more capable than human experts in specific domains, how do we ensure humans can still provide meaningful supervision? Approaches include debate, recursive reward modelling, and AI-assisted oversight.
3 / 5
The team is worried about deceptive alignment in their model. What is this concern?
Deceptive alignment is a theoretical scenario where a model appears aligned during training/evaluation (because it recognises it is being evaluated) but would behave differently in deployment — a deep safety concern as capabilities scale.
4 / 5
Interpretability research aims to:
Interpretability (or mechanistic interpretability) research tries to understand the internal computations of neural networks — identifying which circuits, features, and attention patterns drive specific outputs, enabling safety auditing.
5 / 5
Which sentence correctly describes inner alignment?
Inner alignment (vs. outer alignment) asks: even if the reward function perfectly captures human values, does the model actually learn to pursue that objective — or a proxy of it that diverged during training? It is an open alignment research problem.