5 exercises — practise professional English answers for NLP Engineer interviews.
Structure for NLP Engineer answers
Tip 1: Explain the transformer architecture: multi-head self-attention, positional encoding, encoder vs decoder vs encoder-decoder
Tip 2: Distinguish tokenisation strategies: BPE, WordPiece, SentencePiece, and their trade-offs for OOV handling
Tip 3: Fine-tuning vocabulary: full fine-tuning vs LoRA/PEFT, task-specific head, overfitting risk on small datasets
Tip 4: Embeddings: static (Word2Vec, GloVe) vs contextual (BERT) — why contextual wins for polysemy
0 / 5 completed
1 / 5
The interviewer asks: "Explain the self-attention mechanism in transformers and why it is more powerful than RNNs for NLP." Which answer best demonstrates architectural understanding?
Option B is strongest because it gives the mathematical formulation, explains multi-head attention, and compares transformer vs RNN trade-offs. Key structure: Q/K/V vectors → softmax(QKᵀ/√d_k) × V → multi-head (h heads, different relationships) → parallel computation → O(1) dependency distance vs RNN O(n) → O(n²) memory weakness. Option A correctly notes parallelism but lacks the mechanism. Option C is a vague and inaccurate description. Option D mischaracterises the motivation for attention.
2 / 5
The interviewer asks: "What is Byte-Pair Encoding (BPE) and why is it used for tokenisation?" Which answer best demonstrates tokenisation knowledge?
Option B is strongest because it explains the algorithm (iterative merge), gives an example, lists the advantages (OOV, multilingual, efficiency), and distinguishes BPE from WordPiece. Key structure: character start → merge most frequent pairs → target vocab size → OOV decomposition → multilingual → GPT BPE vs BERT WordPiece. Option A correctly describes the algorithm but lacks examples and advantages. Option C correctly names BPE's origin but mischaracterises the NLP use. Option D describes a different (incorrect) tokenisation approach.
3 / 5
The interviewer asks: "When would you use parameter-efficient fine-tuning (PEFT/LoRA) instead of full fine-tuning?" Which answer best demonstrates LLM fine-tuning expertise?
Option B is strongest because it explains LoRA's mechanism (low-rank matrices), quantifies the parameter savings, and gives concrete decision criteria. Key structure: freeze base + A×B low-rank matrices → 0.1-1% trainable → single GPU for 7B → swappable adaptors → anti-catastrophic-forgetting → small dataset fit; full fine-tuning when extreme domain shift + abundant data. Option A only identifies one use case. Option C confuses fine-tuning with inference optimisation. Option D presents a false and arbitrary rule.
4 / 5
The interviewer asks: "What is Named Entity Recognition (NER) and how would you evaluate a NER model?" Which answer best demonstrates NLP evaluation maturity?
Option B is strongest because it defines NER, explains entity-level (not token) evaluation, distinguishes exact vs partial match, and separates micro vs macro F1. Key structure: NER span classification → entity-level F1 (not accuracy) → exact vs partial match → micro-F1 (frequent types) vs macro-F1 (rare types) → boundary error vs type confusion → domain distribution. Option A uses accuracy which is inappropriate for NER (class imbalance — most tokens are O-label). Option C describes a generic train/test split without appropriate metrics. Option D incorrectly applies BLEU (a translation metric).
5 / 5
The interviewer asks: "How do you handle multilingual NLP — building a model that works across multiple languages?" Which answer best demonstrates multilingual NLP engineering?
Option B is strongest because it names specific multilingual models, explains transfer strategies, addresses the tokeniser and capacity challenges, and mentions benchmarks. Key structure: XLM-RoBERTa base → zero-shot transfer vs translate-train → SentencePiece tokeniser for low-resource → curse of multilinguality → language-specific LoRA → XTREME/XGLUE evaluation. Option A (separate models) is expensive and misses cross-lingual transfer benefits. Option C (translate to English) loses nuance and is slow for production. Option D describes a real technique (language token) but is incomplete as a strategy.