5 exercises — practise professional English answers for Computer Vision Engineer interviews.
Structure for Computer Vision Engineer answers
Tip 1: Name the detection pipeline: backbone (feature extraction) → neck (FPN) → head (class + bbox regression)
Tip 2: Explain evaluation metrics: IoU threshold, precision-recall curve, mAP@0.5 vs mAP@0.5:0.95
Tip 3: Address dataset quality: annotation consistency, class imbalance, augmentation strategies
Tip 4: Mention deployment trade-offs: model quantisation, TensorRT optimisation, latency vs accuracy
0 / 5 completed
1 / 5
The interviewer asks: "Explain how a modern object detection model like YOLO or DETR works." Which answer best demonstrates architectural understanding?
Option B is strongest because it describes the full pipeline for both architectures and highlights their key architectural differences. Key structure: YOLO: backbone → neck (FPN) → anchor-grid heads; YOLOv8: anchor-free; DETR: CNN features → transformer encoder → object queries → bipartite matching → no NMS. Option A describes only YOLO's grid mechanism without the architecture depth. Option C incorrectly describes DETR as sequential CNN+transformer. Option D describes only pre-training, not detection.
2 / 5
The interviewer asks: "What is Intersection over Union (IoU) and how is it used in object detection evaluation?" Which answer best demonstrates metric understanding?
Option B is strongest because it gives the formula, the TP/FP rule, connects IoU to mAP, and distinguishes COCO mAP variants. Key structure: IoU = Intersection/Union → TP rule at threshold → precision-recall curve → AP per class → mAP → COCO@0.5:0.95 stricter. Option A is correct but vague — no formula, no TP/FP rule, no connection to mAP. Option C confuses IoU with pixel accuracy (a segmentation metric, not object detection). Option D presents an oversimplified and incorrect rule.
3 / 5
The interviewer asks: "How do you handle class imbalance in a computer vision dataset?" Which answer best demonstrates dataset management depth?
Option B is strongest because it addresses imbalance at multiple levels: augmentation, loss function, sampling, synthetic data, and evaluation. Key structure: mosaic/mixup/copy-paste augmentation → focal loss weighting → weighted sampler → synthetic data → per-class AP tracking → annotation quality audit. Option A (collect more data) is valid but often impractical and ignores other strategies. Option C mentions only basic augmentation, which is a weak response. Option D is factually incorrect — a detector head seeing a background-dominated image still suffers from foreground class imbalance.
4 / 5
The interviewer asks: "What steps would you take to optimise a computer vision model for edge deployment?" Which answer best demonstrates production ML engineering?
Option B is strongest because it covers the full model optimisation pipeline from architecture choice through deployment benchmarking. Key structure: efficient architecture → INT8 quantisation (TensorRT/ONNX) → structured pruning → knowledge distillation → TensorRT/TFLite export → target hardware benchmarking → Pareto frontier. Option A names only one step (architecture choice). Option C confuses file compression with model optimisation. Option D (lower resolution) is a valid trick but not a complete answer.
5 / 5
The interviewer asks: "How do you ensure annotation quality in a computer vision labelling project?" Which answer best demonstrates data quality engineering?
Option B is strongest because it addresses annotation quality systematically across process, measurement, tooling, and continuous monitoring. Key structure: guidelines with visual edge cases → IAA (Cohen's Kappa >0.85) → consensus labelling → automated outlier detection → active learning → golden set benchmarking. Option A addresses only inputs (people and guidelines). Option C (10% random review) is necessary but not sufficient. Option D (auto-label + review) is a valid strategy but the response omits the QA pipeline details.