English for Weaviate Developers
Master the English vocabulary developers use for vector search, schema classes, and hybrid queries when discussing Weaviate with a team.
Vector databases like Weaviate bring a vocabulary that mixes information retrieval terms (relevance, recall) with vector-math terms (distance metrics, dimensionality) — and a team building search or RAG features needs to speak both precisely, since “the search results are bad” can mean very different things depending on which layer the problem is in. This guide covers the English used when discussing Weaviate with a team.
Key Vocabulary
Vector embedding — a numeric representation of text, images, or other data produced by a model, positioned in a high-dimensional space so semantically similar items are close together. “The search is missing obvious synonyms because the embedding model was trained on a different domain — swapping to a domain-tuned model should fix the recall.”
Nearest neighbor search (ANN) — approximate nearest neighbor search, the algorithm class Weaviate uses to find the closest vectors to a query vector quickly, trading a small amount of accuracy for large speed gains over exact search. “At this collection size, exact nearest neighbor search is too slow for production — we need to tune the ANN index parameters instead of disabling approximation entirely.”
Distance metric — the function (cosine, dot product, Euclidean/L2) used to measure similarity between two vectors, which must match how the embedding model was trained to produce meaningful results. “We’re getting poor rankings because the collection is configured for L2 distance, but the embedding model was trained and normalized for cosine similarity.”
Hybrid search — combining vector similarity search with traditional keyword (BM25) search in a single query, usually blended with a weighting parameter, to catch both semantic matches and exact term matches. “Pure vector search missed the exact product SKU in the query — switching to hybrid search with a higher keyword weight should surface exact matches like that reliably.”
Schema class — Weaviate’s term for a collection type with a defined set of properties and vectorization configuration, analogous to a table in a relational database. “Before we add this new property to the schema class, remember that changing vectorized properties on an existing class usually requires re-indexing the whole collection.”
Certainty / score threshold — a cutoff applied to search results based on similarity score, used to exclude results that are technically the “nearest” but not actually relevant. “Returning the top ten results regardless of score is misleading the user when there are only two genuinely relevant matches — let’s apply a certainty threshold and show fewer, better results.”
Common Phrases
- “Does the distance metric here match how the embedding model was trained?”
- “Is this a recall problem with the embeddings, or a ranking problem with the query?”
- “Should this be a hybrid search instead of pure vector similarity?”
- “Does adding this property to the schema class require re-indexing?”
- “Are we applying a score threshold, or just returning the raw top-N regardless of relevance?”
Example Sentences
Reviewing a pull request: “This query returns the top twenty nearest neighbors unconditionally — let’s add a certainty threshold so we’re not surfacing barely-related results just to fill the count.”
Explaining a design decision: “We went with hybrid search because our users search by both natural language descriptions and exact part numbers, and pure vector search was missing the latter.”
Describing a bug: “Search quality dropped after the embedding model upgrade because we kept the old distance metric configuration — the new model expects cosine similarity, not the L2 we had configured.”
Professional Tips
- Say “recall” when the problem is relevant results not appearing at all, and “ranking” when relevant results appear but in the wrong order — conflating them sends debugging in the wrong direction.
- When reviewing search behavior, ask “does the distance metric match the embedding model’s training?” — mismatches here are a very common, easy-to-miss cause of poor results.
- Use “hybrid search” precisely to mean the combination of vector and keyword search — it’s a specific, named technique, not a vague description of “using two kinds of search.”
- Distinguish “ANN” (approximate nearest neighbor, the everyday production mode) from “exact nearest neighbor” (slower, used mainly for small collections or ground-truth comparison).
Practice Exercise
- Explain in two sentences why a mismatched distance metric can degrade search quality even with a good embedding model.
- Write a one-sentence recommendation for when to use hybrid search instead of pure vector search.
- Describe, in your own words, the difference between a recall problem and a ranking problem.