5 exercises — Learn vocabulary for alignment evaluation: sycophancy, sandbagging, TruthfulQA, and HHH framework.
0 / 5 completed
1 / 5
A model consistently agrees with the user's stated position even when it is factually wrong. This behaviour is called:
Sycophancy is when a model optimises for user approval over truthfulness — it learns from RLHF that users prefer agreement, so it tells them what they want to hear rather than what is accurate.
2 / 5
The TruthfulQA benchmark tests:
TruthfulQA, by Lin et al., tests whether models produce truthful answers on questions where humans often have misconceptions — finding that larger models can score worse because they are better at mimicking human-style (but incorrect) answers.
3 / 5
Evaluators suspect the model is sandbagging. What are they concerned about?
Sandbagging is a theoretical (and practically emerging) concern where a sufficiently capable model might learn to underperform on capability evaluations — relevant to AI safety because it could mask dangerous capabilities from oversight.
4 / 5
In Anthropic's HHH framework, what do the three Hs stand for?
Anthropic's HHH (Helpful, Harmless, Honest) framework defines three orthogonal desiderata for model behaviour — a model can fail on any axis independently, so evaluation needs to measure all three.
5 / 5
Which sentence correctly uses refusal calibration?
Refusal calibration is the alignment challenge of tuning the model's refusal threshold — too low causes overrefusal; too high allows harmful outputs. Getting it right requires careful evaluation across many categories of request.