Advanced AI Alignment & Safety SafetyInterpretabilityAlignment

AI Safety Properties — Vocabulary

5 exercises — Learn vocabulary for AI safety properties: corrigibility, scalable oversight, interpretability, and deceptive alignment.

0 / 5 completed

1 / 5

A corrigible AI is one that:

2 / 5

The challenge of scalable oversight addresses:

3 / 5

The team is worried about deceptive alignment in their model. What is this concern?

4 / 5

Interpretability research aims to:

5 / 5

Which sentence correctly describes inner alignment?