Compiler Engineering Vocabulary: From Lexer to Code Generation

A comprehensive English vocabulary guide for compiler engineers covering the full pipeline from lexing and parsing through IR, optimisation passes, and code generation.

English in the Compiler Pipeline

Compiler engineering has a rich and precise vocabulary that has accumulated over decades of research and implementation. For engineers working on compilers — whether writing passes in LLVM, contributing to GCC, or building a language front-end — communicating in this vocabulary with fluency and precision is essential for code reviews, design discussions, and documentation.

This guide walks through the compiler pipeline from front-end to back-end, covering the English terms used at each stage.


Front-End Vocabulary: Lexing and Parsing

The front-end of a compiler transforms source text into a structured representation.

TermDefinition
Lexer (tokeniser)The component that reads source characters and produces a stream of tokens
TokenA categorised unit of source text, such as an identifier, literal, or keyword
LexemeThe actual character sequence that matched a token pattern
ParserThe component that reads a token stream and constructs a parse tree or AST
GrammarA formal description of the syntactic rules of a language
Production ruleA single rule in a grammar defining how a non-terminal can be expanded
LookaheadThe number of tokens the parser examines ahead of the current position to make a decision
AmbiguityA property of a grammar where a single input can be parsed in multiple ways
Abstract Syntax Tree (AST)A tree representation of the syntactic structure of source code, with implementation detail removed

When reviewing a parser implementation, a common observation is: “This production rule is ambiguous — we need to document the precedence convention or restructure the grammar.”


Middle-End Vocabulary: IR and Optimisation

The middle-end operates on an Intermediate Representation (IR), which is the compiler’s internal language for analysis and optimisation.

TermDefinition
Intermediate Representation (IR)A language-independent, architecture-independent representation of program semantics
Static Single Assignment (SSA) formAn IR property where every variable is assigned exactly once and use-def chains are explicit
Phi node (φ-node)An SSA construct that selects a value based on which predecessor basic block was executed
Basic blockA straight-line sequence of instructions with a single entry and single exit
Control flow graph (CFG)A graph where nodes are basic blocks and edges represent possible transfers of control
Data flow analysisA technique for computing properties of values at each program point
DominatorA node D dominates node N if every path to N passes through D
Alias analysisAnalysis that determines whether two pointers could refer to the same memory location
InliningReplacing a function call with a copy of the called function’s body
Loop invariant code motion (LICM)Moving computations out of a loop when their result does not change across iterations

LLVM-Specific Vocabulary

LLVM has its own set of terms and conventions used in discussions, documentation, and code review.

TermDefinition
PassA transformation or analysis that operates on the IR; LLVM programs are optimised by a sequence of passes
Analysis passA pass that computes information about the IR without modifying it
Transform passA pass that modifies the IR to improve it in some way
Pass managerThe infrastructure that schedules and runs passes efficiently
CanonicaliseTo transform an IR construct into a standard, normalised form that is easier to reason about
EmitTo produce output — either a new IR form or target machine code
LegalizationThe process of transforming IR operations into forms the target machine supports
Instruction selectionThe mapping of IR instructions to target machine instructions
Register allocationThe assignment of IR virtual registers to physical machine registers
SpillTo move a value from a register to memory because there are insufficient registers available

In LLVM code reviews, “canonicalise” is a frequently used instruction: “Before optimising this pattern, canonicalise it to the form the middle-end expects.”


Back-End Vocabulary: Register Allocation and Code Generation

TermDefinition
Virtual registerAn unlimited, abstract register used in IR before register allocation
Calling conventionThe protocol governing how arguments, return values, and registers are managed across function calls
Prologue / EpilogueThe function entry/exit code that sets up and tears down the stack frame
Peephole optimisationA local optimisation that examines a small window of instructions and replaces them with more efficient equivalents
Instruction schedulingReordering instructions to improve performance, typically to avoid pipeline stalls
RelocationA reference in object code that must be resolved to an address at link time
Object fileThe output of the assembler, containing machine code and metadata for the linker

Example Sentences

  1. “The current lexer does not handle Unicode identifiers correctly — we need to update the token classification rules before the next release.”
  2. “After SSA construction, all phi nodes in the entry block are trivial and can be eliminated — the mem2reg pass should handle this automatically.”
  3. “This pass canonicalises all GEP instructions to a normalised form, which makes subsequent alias analysis more precise.”
  4. “The register allocator is spilling aggressively in this hot loop; we should examine the register pressure and consider splitting the loop into two phases.”
  5. “The LLVM backend emits a function prologue that is unnecessarily large for leaf functions — we can apply the leaf function optimisation to eliminate the frame pointer setup.”

Communication Tips for Compiler Engineers

When discussing a compiler bug, distinguish between a miscompilation (the compiler generates incorrect code) and a compiler crash (the compiler fails to produce output). These have different debugging approaches and different levels of severity.

When writing pass documentation or code comments, prefer the active voice and imperative mood: “This pass transforms… The analysis computes… Call this pass after…” — not “This pass is used to transform…”

The verb “lower” has a specific meaning in compiler vocabulary: to replace a high-level IR construct with a more concrete one closer to the target machine. “We lower intrinsics to platform-specific builtins in the legalization phase.” Avoid using it in its everyday English sense when writing compiler documentation.