5 exercises — ETL, LLM, RAG, OLAP, NLP, ML and the data pipeline and machine learning vocabulary every engineer encounters.
Acronyms covered in this set
ETL / ELT — data pipeline patterns (the order matters)
LLM / RAG — Large Language Model / Retrieval-Augmented Generation
OLAP / OLTP — analytical vs. transactional query workloads
NLP — Natural Language Processing
ML / GPU — Machine Learning / Graphics Processing Unit
0 / 5 completed
1 / 5
A data engineer explains a pipeline: "We use an ETL process to extract data from the source, transform it, and load it into the warehouse — but recently we've been moving to ELT instead." What is the key difference between ETL and ELT?
ETL = Extract, Transform, Load. The traditional data pipeline pattern: (1) Extract data from source systems, (2) Transform it in a separate processing layer (cleaning, joining, aggregating), (3) Load the clean, structured data into the destination (data warehouse). Best when: transformation is complex, source data is sensitive (PII removed before loading), destination storage is expensive. ELT = Extract, Load, Transform. A modern variant: (1) Extract data from sources, (2) Load raw data directly into the cloud warehouse (like BigQuery, Snowflake, Redshift), (3) Transform in-place using SQL inside the warehouse. Best when: cloud warehouse is cheap and powerful (BigQuery/Snowflake can handle massive transformations), you want to preserve raw data for reprocessing, dbt-style SQL transformations are preferred. The shift to ELT is driven by cloud data warehouses making in-warehouse computation economical. Say: "E-T-L", "E-L-T" (letter by letter).
2 / 5
A ML engineer describes a model performance issue: "Our LLM is producing hallucinations — the RAG architecture should help anchor it to real documents." What are LLM and RAG?
LLM = Large Language Model. A type of AI model trained on massive text datasets to generate, summarise, translate, and reason about text. Examples: GPT-4, Claude, Gemini, Llama. LLMs predict the next token in a sequence and can generate fluent, contextually appropriate text. Key limitation: they can "hallucinate" — generate confident-sounding but factually incorrect information, because they recall patterns from training data rather than retrieving ground truth. RAG = Retrieval-Augmented Generation. An architectural pattern that addresses hallucination: when a user asks a question, the system first retrieves relevant documents (from a vector database, search index, or knowledge base), then provides those documents as context to the LLM alongside the question. The LLM generates its answer grounded in the retrieved documents, dramatically reducing hallucination. In practice: "We built a RAG pipeline over our internal docs — now the chatbot answers based on our actual runbooks." Say: "L-L-M" (letter by letter), "RAG" as a word (/ræɡ/).
3 / 5
A data analyst explains a query performance issue: "We run OLAP queries on the warehouse but OLTP queries on the production database — mixing them causes the slowdowns." What is the difference between OLAP and OLTP?
OLAP = Online Analytical Processing. Designed for complex analytical queries over large historical datasets. Optimised for read-heavy, aggregation-heavy workloads: "What were our total sales by region and product category last quarter?" OLAP systems: Google BigQuery, Snowflake, Amazon Redshift, ClickHouse, Apache Druid. Key characteristics: columnar storage (reads only needed columns), massive parallelism, denormalized schemas (star schema, fact tables). OLTP = Online Transaction Processing. Designed for high-frequency, low-latency transactional operations. Optimised for write-heavy workloads: processing orders, updating balances, inserting records. OLTP systems: PostgreSQL, MySQL, MongoDB, Oracle DB. Key characteristics: row storage, ACID transactions, normalised schemas, indexes for fast lookups. Why mixing them fails: an OLAP query scanning billions of rows on a production OLTP database locks tables, slows writes, and causes latency spikes. The solution: separate the workloads, use CDC (Change Data Capture) to sync to the warehouse. Say: "O-L-A-P", "O-L-T-P" (letter by letter).
4 / 5
A data scientist presents results: "The NLP model achieves 94% accuracy on the classification task, but we're seeing class imbalance — the F1 score is a better metric here." What is NLP?
NLP = Natural Language Processing. The field of AI and machine learning focused on enabling computers to understand, interpret, and generate human language. Common NLP tasks: Text classification (spam detection, sentiment analysis), Named Entity Recognition (NER) (extracting names, dates, locations from text), Machine Translation (English → French), Summarisation (condensing long documents), Question Answering, Text Generation (LLMs). Classic NLP tools: spaCy, NLTK, Stanford NLP. Modern NLP: transformer-based models (BERT, GPT series, T5). The F1 score mentioned is the harmonic mean of precision and recall — essential when classes are imbalanced (e.g. 99% legitimate emails, 1% spam — a model that predicts all legitimate gets 99% accuracy but 0% recall on spam). Say it: "N-L-P" (letter by letter). Used in many IT roles: search engines, chatbots, code completion (GitHub Copilot), documentation tools.
5 / 5
A data engineer says: "We store our model embeddings in a vector DB, and the raw logs go into an OLAP warehouse — our ML pipeline runs on GPU clusters." In this context, what does ML stand for?
ML = Machine Learning. A subset of artificial intelligence where systems learn patterns from data rather than following explicitly programmed rules. ML training involves: (1) feeding labeled data (or unlabeled data for unsupervised learning), (2) the model adjusting its parameters (weights) to minimise error, (3) evaluating on held-out test data. Key ML paradigms: Supervised learning (labeled data: classification, regression), Unsupervised learning (no labels: clustering, dimensionality reduction), Reinforcement learning (reward signals: games, robotics), Self-supervised learning (LLMs pretrain on predicting next tokens). GPU = Graphics Processing Unit. Originally for graphics rendering, GPUs excel at parallel matrix computations — the same math used in neural network training. An NVIDIA A100 GPU can train models 100x faster than a CPU. In ML infrastructure: "We need 8 A100s to fine-tune this model in a reasonable time." Say: "M-L" (letter by letter), "G-P-U" (letter by letter). In conversation: "ML pipeline", "ML ops" (MLOps), "ML engineer vs. data scientist" (blurry line).