5 exercises — Learn mechanistic interpretability vocabulary: circuits in neural networks, superposition, activation patching, probing classifiers, and sparse autoencoders.
0 / 5 completed
1 / 5
In mechanistic interpretability, what are circuits in neural networks?
Circuits (Olah et al., Anthropic) are sparse sub-graphs within neural networks — specific sets of neurons and the weights connecting them — that implement a recognisable computation, such as detecting a curve, identifying a proper noun, or performing indirect object identification. Reverse-engineering these circuits is the core goal of mechanistic interpretability.
2 / 5
What is superposition (polysemantic neurons) in the context of neural network interpretability?
Superposition occurs when a model represents more features than it has neurons by encoding multiple features in a single neuron (polysemanticity) or in overlapping combinations. This happens because the model needs to represent more concepts than it has dedicated dimensions for. It makes mechanistic interpretability harder — neurons don't map cleanly to single concepts.
3 / 5
A researcher says: "The feature represents the concept of X." In the context of sparse autoencoders, what is a "feature"?
In mechanistic interpretability, a feature is a linear direction in the model's activation space that reliably encodes a particular concept (e.g. "the word is a colour", "this is a Python function"). Sparse autoencoders (SAEs) are trained to decompose model activations into a large set of sparse, human-interpretable features — each feature corresponding to a recoverable concept.
4 / 5
What is activation patching used for in mechanistic interpretability experiments?
Activation patching is a causal intervention technique. Researchers run the model on two different inputs (one that produces a target behaviour, one that doesn't) and then "patch" activations from the first run into specific layers or attention heads of the second run. If the target behaviour appears, those patched components are causally responsible for that behaviour.
5 / 5
What does a probing classifier measure in interpretability research?
A probing classifier is a lightweight classifier (often logistic regression) trained on the internal representations (activations) of a layer to predict whether a concept is present (e.g. is this token a verb? is this sentence sentiment positive?). High probe accuracy at a given layer suggests the model has linearly encoded that concept there — and it's a common tool for localising information in neural networks.