Advanced AI Alignment & Safety InterpretabilityCircuitsFeatures

Mechanistic Interpretability Vocabulary

5 exercises — Learn mechanistic interpretability vocabulary: circuits in neural networks, superposition, activation patching, probing classifiers, and sparse autoencoders.

0 / 5 completed

1 / 5

In mechanistic interpretability, what are circuits in neural networks?

2 / 5

What is superposition (polysemantic neurons) in the context of neural network interpretability?

3 / 5

A researcher says: "The feature represents the concept of X." In the context of sparse autoencoders, what is a "feature"?

4 / 5

What is activation patching used for in mechanistic interpretability experiments?

5 / 5

What does a probing classifier measure in interpretability research?