Advanced AI Alignment & Safety BenchmarksEvaluationAlignment

Alignment Benchmarks & Evaluation — Vocabulary

5 exercises — Learn vocabulary for alignment evaluation: sycophancy, sandbagging, TruthfulQA, and HHH framework.

0 / 5 completed

1 / 5

A model consistently agrees with the user's stated position even when it is factually wrong. This behaviour is called:

2 / 5

The TruthfulQA benchmark tests:

3 / 5

Evaluators suspect the model is sandbagging. What are they concerned about?

4 / 5

In Anthropic's HHH framework, what do the three Hs stand for?

5 / 5

Which sentence correctly uses refusal calibration?