Compiler Engineering Vocabulary: From Lexer to Code Generation
A comprehensive English vocabulary guide for compiler engineers covering the full pipeline from lexing and parsing through IR, optimisation passes, and code generation.
English in the Compiler Pipeline
Compiler engineering has a rich and precise vocabulary that has accumulated over decades of research and implementation. For engineers working on compilers — whether writing passes in LLVM, contributing to GCC, or building a language front-end — communicating in this vocabulary with fluency and precision is essential for code reviews, design discussions, and documentation.
This guide walks through the compiler pipeline from front-end to back-end, covering the English terms used at each stage.
Front-End Vocabulary: Lexing and Parsing
The front-end of a compiler transforms source text into a structured representation.
| Term | Definition |
|---|---|
| Lexer (tokeniser) | The component that reads source characters and produces a stream of tokens |
| Token | A categorised unit of source text, such as an identifier, literal, or keyword |
| Lexeme | The actual character sequence that matched a token pattern |
| Parser | The component that reads a token stream and constructs a parse tree or AST |
| Grammar | A formal description of the syntactic rules of a language |
| Production rule | A single rule in a grammar defining how a non-terminal can be expanded |
| Lookahead | The number of tokens the parser examines ahead of the current position to make a decision |
| Ambiguity | A property of a grammar where a single input can be parsed in multiple ways |
| Abstract Syntax Tree (AST) | A tree representation of the syntactic structure of source code, with implementation detail removed |
When reviewing a parser implementation, a common observation is: “This production rule is ambiguous — we need to document the precedence convention or restructure the grammar.”
Middle-End Vocabulary: IR and Optimisation
The middle-end operates on an Intermediate Representation (IR), which is the compiler’s internal language for analysis and optimisation.
| Term | Definition |
|---|---|
| Intermediate Representation (IR) | A language-independent, architecture-independent representation of program semantics |
| Static Single Assignment (SSA) form | An IR property where every variable is assigned exactly once and use-def chains are explicit |
| Phi node (φ-node) | An SSA construct that selects a value based on which predecessor basic block was executed |
| Basic block | A straight-line sequence of instructions with a single entry and single exit |
| Control flow graph (CFG) | A graph where nodes are basic blocks and edges represent possible transfers of control |
| Data flow analysis | A technique for computing properties of values at each program point |
| Dominator | A node D dominates node N if every path to N passes through D |
| Alias analysis | Analysis that determines whether two pointers could refer to the same memory location |
| Inlining | Replacing a function call with a copy of the called function’s body |
| Loop invariant code motion (LICM) | Moving computations out of a loop when their result does not change across iterations |
LLVM-Specific Vocabulary
LLVM has its own set of terms and conventions used in discussions, documentation, and code review.
| Term | Definition |
|---|---|
| Pass | A transformation or analysis that operates on the IR; LLVM programs are optimised by a sequence of passes |
| Analysis pass | A pass that computes information about the IR without modifying it |
| Transform pass | A pass that modifies the IR to improve it in some way |
| Pass manager | The infrastructure that schedules and runs passes efficiently |
| Canonicalise | To transform an IR construct into a standard, normalised form that is easier to reason about |
| Emit | To produce output — either a new IR form or target machine code |
| Legalization | The process of transforming IR operations into forms the target machine supports |
| Instruction selection | The mapping of IR instructions to target machine instructions |
| Register allocation | The assignment of IR virtual registers to physical machine registers |
| Spill | To move a value from a register to memory because there are insufficient registers available |
In LLVM code reviews, “canonicalise” is a frequently used instruction: “Before optimising this pattern, canonicalise it to the form the middle-end expects.”
Back-End Vocabulary: Register Allocation and Code Generation
| Term | Definition |
|---|---|
| Virtual register | An unlimited, abstract register used in IR before register allocation |
| Calling convention | The protocol governing how arguments, return values, and registers are managed across function calls |
| Prologue / Epilogue | The function entry/exit code that sets up and tears down the stack frame |
| Peephole optimisation | A local optimisation that examines a small window of instructions and replaces them with more efficient equivalents |
| Instruction scheduling | Reordering instructions to improve performance, typically to avoid pipeline stalls |
| Relocation | A reference in object code that must be resolved to an address at link time |
| Object file | The output of the assembler, containing machine code and metadata for the linker |
Example Sentences
- “The current lexer does not handle Unicode identifiers correctly — we need to update the token classification rules before the next release.”
- “After SSA construction, all phi nodes in the entry block are trivial and can be eliminated — the mem2reg pass should handle this automatically.”
- “This pass canonicalises all GEP instructions to a normalised form, which makes subsequent alias analysis more precise.”
- “The register allocator is spilling aggressively in this hot loop; we should examine the register pressure and consider splitting the loop into two phases.”
- “The LLVM backend emits a function prologue that is unnecessarily large for leaf functions — we can apply the leaf function optimisation to eliminate the frame pointer setup.”
Communication Tips for Compiler Engineers
When discussing a compiler bug, distinguish between a miscompilation (the compiler generates incorrect code) and a compiler crash (the compiler fails to produce output). These have different debugging approaches and different levels of severity.
When writing pass documentation or code comments, prefer the active voice and imperative mood: “This pass transforms… The analysis computes… Call this pass after…” — not “This pass is used to transform…”
The verb “lower” has a specific meaning in compiler vocabulary: to replace a high-level IR construct with a more concrete one closer to the target machine. “We lower intrinsics to platform-specific builtins in the legalization phase.” Avoid using it in its everyday English sense when writing compiler documentation.