LLM Application Development Language

6 exercises covering the English developers use when building LLM-powered applications — RAG pipelines, function calling, evaluation metrics, prompt engineering, LLMOps, and agentic systems.

Frequently Asked Questions

What is RAG and what vocabulary describes it in English?

RAG stands for Retrieval-Augmented Generation. The vocabulary includes "chunking," "embedding," "vector store," "cosine similarity," "retrieval pipeline," "re-ranking," and "hybrid search." Engineers describe RAG systems by saying "the retriever fetches the top-k most semantically similar chunks, which are then injected into the prompt as context before the LLM generates its response."

What does "fine-tuning" mean and when is it recommended over prompt engineering?

Fine-tuning involves training a pre-trained model on a domain-specific dataset to adapt its behavior, vocabulary, or tone. Engineers recommend fine-tuning when the desired behavior cannot be reliably achieved through prompting alone — for example, when the model must produce outputs in a very specific format or use proprietary terminology consistently. It is more expensive than prompting and requires labeled training data.

What is an embedding and how is the concept explained in English?

An embedding is a numerical vector representation of text (or other data) that captures semantic meaning. When two pieces of text have similar embeddings, they are semantically similar — close together in vector space. Engineers explain this by saying "we embed both the query and the knowledge base documents, then retrieve documents whose vectors are nearest to the query vector."

What does "prompt engineering" mean and which vocabulary terms are essential?

Prompt engineering is the practice of crafting and iterating on inputs to an LLM to reliably produce desired outputs. Key vocabulary includes "system prompt," "few-shot examples," "chain-of-thought," "temperature," "top-p," "context window," "prompt injection," and "jailbreak." Advanced practitioners also discuss "self-consistency," "ReAct prompting," and "structured output constraints."

What is function calling (tool use) in LLM applications?

Function calling allows an LLM to request the execution of predefined functions by outputting structured JSON describing the function name and arguments. The application executes the function and returns the result to the model. Engineers describe this as "the model decides which tool to invoke based on the user's intent, then integrates the tool's response into its final answer."

How is LLM application evaluation discussed in English?

Evaluation vocabulary includes "faithfulness," "answer relevance," "context recall," "hallucination rate," "LLM-as-judge," and "eval suite." Teams set up automated evaluation pipelines that score model responses against a golden dataset. The phrase "our system achieves 92% faithfulness on the RAG eval harness" is a typical way to present results to stakeholders.

What is LLMOps and what does it involve?

LLMOps (Large Language Model Operations) is the set of practices for deploying, monitoring, and iterating on LLM-powered applications in production. Key activities include prompt versioning, model registry management, latency and cost observability, A/B testing between model versions, and setting up alerting on quality degradation. The field borrows heavily from MLOps but adds LLM-specific concerns like prompt drift and context window management.

What is an AI agent in the context of LLM applications?

An AI agent is an LLM-powered system that autonomously takes actions — calling tools, querying databases, browsing the web — to complete multi-step tasks. Vocabulary includes "ReAct pattern," "memory (short-term and long-term)," "planning step," "reflection loop," and "multi-agent orchestration." Engineers describe agents as systems where "the model decides the next action based on the current state and available tools."

What does "context window" mean and why does it matter for application architecture?

The context window is the maximum number of tokens an LLM can process in a single inference call, including the prompt and generated output. Application architecture must stay within this limit, which influences chunking strategies, conversation memory management, and the amount of retrieved context injected into each prompt. Engineers say "we truncate older conversation turns to keep the context within the model's token budget."

What vocabulary is used to describe vector databases in English?

Vector database vocabulary includes "index," "upsert," "nearest neighbor search," "approximate nearest neighbor (ANN)," "namespace," "metadata filtering," and "embedding dimensionality." Popular systems referenced in conversations include Pinecone, Weaviate, Qdrant, and pgvector. Engineers say things like "we filter the vector search by document type metadata before ranking by cosine similarity."