The interviewer asks: "Explain NDCG@k and why it is preferred over precision@k for search relevance evaluation." Which answer is most precise?
Option B is strongest. It states the DCG formula with variable definitions (not just a description), explains the normalisation step and why it matters for cross-query averaging, provides three named reasons NDCG beats precision@k, and crucially gives a concrete example showing why two "somewhat relevant" results score lower than one "perfect" result at position 1 (this is the key intuition most candidates cannot articulate). It also closes with the practical limitation — graded labels are expensive — and a realistic mitigation. Search evaluation vocabulary:DCG — Discounted Cumulative Gain, sum of discounted graded relevances. IDCG — Ideal DCG, the DCG achieved by the perfect ranking. Graded relevance — relevance labels on a multi-point scale (0-3) rather than binary. Position discount — the log2(i+1) denominator that reduces the contribution of lower-ranked results. SERP annotation — human raters labelling search result pages for relevance. Options C and D are accurate but present the formula without the intuitive example and lack the cross-query comparability explanation.
2 / 5
The interviewer asks: "When would you use BM25 over a neural dense retrieval model, and when would you combine them?" Which answer is most nuanced?
Option B is strongest. It frames the decision as complementary failure modes — the key mental model for search engineering interviews — and names specific scenarios where each approach wins (product codes for BM25, synonyms for dense), which shows operational experience rather than textbook knowledge. It explains WHY BM25 is faster (inverted index vs. ANN vector search memory requirements), introduces the RRF formula with the specific k=60 constant and explains why RRF avoids score normalisation, and adds the important negative condition (when NOT to combine). The ablation evaluation protocol with specific metrics closes the answer correctly. Hybrid retrieval vocabulary:BM25 — Best Match 25, a probabilistic term-frequency ranking function with document length normalisation. Dense retrieval — embedding-based similarity search using a bi-encoder model. ANN (Approximate Nearest Neighbour) search — the algorithm used to find nearest embedding vectors efficiently. Reciprocal Rank Fusion (RRF) — rank-based merging strategy that avoids score normalisation. Cold-start — the situation where insufficient training data is available to train a dense retrieval model. Options C and D name the right concepts but lack the formula-level detail and the failure mode framing.
3 / 5
The interviewer asks: "Walk me through how you would design an A/B test to measure the impact of a new ranking algorithm on search quality." Which answer is most rigorous?
Option B is strongest. The opening "five design decisions that most engineers get wrong" is a deliberate framing that signals deep experience — it implies the candidate has seen experiments fail due to these exact mistakes. The randomisation unit explanation includes the specific contamination mechanism (treatment behaviour bleeds into control behaviour), which most candidates do not articulate. The primary metric critique of CTR is paired with two better alternatives (ERR and Session Success Rate) with their definitions, which shows the candidate can implement the metric, not just name it. The guardrail metrics section is often missing from candidate answers and is critical for production search teams. The pre-registration + AA test protocol demonstrates statistical rigour beyond just "check p-value." Search A/B testing vocabulary:Randomisation unit — the entity (user, session, query) that is assigned to control/treatment. Session Success Rate — proportion of sessions where the user clicked a result and did not return within a dwell time threshold. Reformulation rate — rate at which users retype or significantly change their query after seeing results. AA test — a pre-experiment test where both groups receive the same treatment to verify unbiased randomisation. Novelty effect — temporary change in user behaviour caused by the newness of an experience. Options C and D are accurate but lack the contamination mechanism explanation and the metric definition nuances.
4 / 5
The interviewer asks: "Explain Learning to Rank (LTR). What are the three LTR approaches and when would you use each?" Which answer is most complete?
Option B is strongest. It explains each approach at the loss function level (what is being optimised, not just what the approach is called), which is the level of depth required for senior search engineering roles. The pointwise limitation is precisely stated (a model predicting constant relevance=2 has zero pointwise loss but is useless), which is a concrete example of why pointwise is insufficient. The pairwise quadratic scaling limitation is correctly identified as the practical bottleneck. The listwise section correctly identifies LambdaMART as the dominant production algorithm and explains the lambda gradient approximation to NDCG. The feature engineering taxonomy (query / document / query-document features) shows operational knowledge. LTR vocabulary:Learning to Rank (LTR) — a supervised ML approach for training a ranking function. Pairwise loss — loss computed over pairs of documents, optimising relative order. LambdaMART — gradient boosted trees with a listwise lambda objective approximating NDCG, the dominant industrial LTR algorithm. Lambda gradient — a gradient trick that makes LambdaMART approximate NDCG optimisation, which is non-differentiable. Options C and D are accurate but lack the pointwise limitation example and the production system recommendation rationale.
5 / 5
The interviewer asks: "How do you approach query understanding in a search system? What NLP components would you build?" Which answer is most architectural?
Option B is strongest. It frames query understanding as a pipeline of five named components with specific purposes and concrete examples for each, which is the architectural framing expected at companies like Google, Elastic, or Algolia. The intent classification section names all four intent types with routing implications (not just "classify intent" — but "transactional intent activates product index retrieval"), which shows operational knowledge. The query expansion section is particularly strong: it names three expansion strategies with different data sources, then adds the critical over-expansion risk and the mitigation (clarity score gating). Query difficulty estimation as a fifth component is a nuance that distinguishes senior candidates. Query understanding vocabulary:Intent classification — categorising queries into informational, navigational, transactional, or local intents. Entity linking — mapping extracted entities to canonical IDs in a structured knowledge graph. Query expansion — adding related terms to improve recall. Topic drift — when query expansion introduces unrelated terms that hurt precision. Query clarity score — an estimate of how specific and well-defined the query is. Options C and D list the components correctly but lack the routing implications, the over-expansion risk explanation, and the difficulty estimation component.