Ai Engineering 8 min read

What Are Embeddings in AI? A Technical Explanation

Embeddings turn text into numbers that capture meaning. Here's how they work, why they matter for search and RAG, and how to choose the right model for your use case.

Computers don’t understand words. They work with numbers. Embeddings are the bridge: they convert text (words, sentences, paragraphs) into dense numerical vectors that capture semantic meaning. Two sentences with similar meanings produce vectors that are close together in space, even if they share no words.

“The cat sat on the mat” and “A feline rested on the rug” have completely different words but nearly identical embeddings. “Bank” in “I went to the bank to deposit money” and “bank” in “The river bank was muddy” produce very different embeddings because the meaning is different, despite the identical word.

This is the foundation of semantic search, RAG, recommendation systems, clustering, and most of modern AI’s ability to work with meaning rather than just text matching.

How Embeddings Work

An embedding model takes text as input and outputs a vector: a list of numbers, typically 384 to 3,072 dimensions depending on the model. Each dimension doesn’t map to a human-interpretable concept. You can’t say “dimension 47 represents positivity.” The meaning is distributed across all dimensions collectively, which is why these are called “distributed representations.”

The model learns these representations during training. It sees billions of text examples and adjusts its weights so that semantically similar texts produce similar vectors. The training objective varies. Some models learn to predict masked words (BERT-style). Others learn that a question and its correct answer should have similar embeddings while a question and a random passage should have dissimilar ones (contrastive learning).

The result is a vector space where distance corresponds to meaning. You can do arithmetic with meaning: the vector for “king” minus “man” plus “woman” produces a vector close to “queen.” The vector for “Paris” minus “France” plus “Japan” lands near “Tokyo.” These relationships emerge from the statistical patterns in training data, not from explicit programming.

Why Dimensions Matter

More dimensions capture more nuance but cost more to store and search. OpenAI’s text-embedding-3-small produces 1,536-dimensional vectors. Each dimension is a 32-bit float (4 bytes), so one embedding is about 6KB. A million embeddings take 6GB. At 3,072 dimensions (the large model), that doubles.

For most applications, 768-1,536 dimensions are plenty. Going higher gives diminishing returns: the additional dimensions capture increasingly subtle distinctions that rarely matter for practical retrieval. Some models support “Matryoshka” embeddings where you can truncate to fewer dimensions (e.g., use only the first 512 of 1,536) with graceful quality degradation, letting you trade accuracy for storage and speed.

Cosine Similarity: How Vectors Get Compared

The standard way to measure how similar two embeddings are is cosine similarity: the cosine of the angle between two vectors. It ranges from -1 (opposite meaning) to 1 (identical meaning), with 0 meaning orthogonal (unrelated).

Cosine similarity ignores magnitude and only considers direction. This matters because embedding models sometimes produce vectors of slightly different lengths for no semantically meaningful reason. By comparing direction only, cosine similarity focuses on meaning.

In practice, you compute the dot product of the two vectors divided by the product of their magnitudes. Most vector databases handle this automatically. You just specify “cosine” as your distance metric and the database returns results ranked by similarity.

Scores above 0.8 typically indicate strong semantic similarity. Scores between 0.5 and 0.8 suggest related content. Below 0.5 is usually not meaningfully related. But these thresholds vary by model, and the absolute numbers matter less than the relative ranking.

Traditional search (Elasticsearch, SQL LIKE queries) matches text literally. Search for “vacation policy” and you find documents containing those words. Documents about “time off guidelines” or “PTO requests” won’t match, even though they’re about the same thing.

Semantic search with embeddings matches on meaning. Embed the query, find the nearest document embeddings, return the closest matches. “Vacation policy,” “time off guidelines,” and “PTO requests” all land in the same region of vector space. The user doesn’t need to guess the exact words the document uses.

This is transformative for knowledge bases, support systems, and any application where users describe what they want in their own words rather than using the exact terminology of the source material.

The limitation: semantic search is weaker on exact matches. A search for error code “ERR-4012” might return results about errors in general instead of the specific error code, because the embedding model treats the code as a minor detail within the broader concept of “errors.” This is why production systems combine semantic search with keyword search (hybrid search) to get the best of both approaches.

Choosing an Embedding Model

The embedding model you choose determines the quality of your entire downstream pipeline. A weak embedding model means weak retrieval, which means weak RAG, which means bad responses. No amount of prompt engineering or model selection downstream can fix poor embeddings upstream.

Key factors:

Domain match. General-purpose embedding models (OpenAI, Cohere, Voyage) work well for general text. Specialized domains (legal, medical, scientific) may benefit from domain-specific models fine-tuned on relevant text. A general model might not distinguish between two similar but legally distinct clauses because it never learned that distinction.

Multilingual support. If your data is in multiple languages, you need a multilingual embedding model. These models map semantically equivalent text from different languages to nearby vectors, so a query in English retrieves relevant documents in French or Japanese.

Sequence length. Models have maximum input lengths (typically 512-8,192 tokens). Text longer than this gets truncated, and you lose the truncated content. For long documents, chunk first, then embed each chunk. The chunk size should fit comfortably within the model’s maximum sequence length.

Hosted vs. local. OpenAI and Cohere embeddings are fast, easy, and require no infrastructure. You send text to their API, get vectors back. Local models (via sentence-transformers or similar) are free per-request, keep data private, and run anywhere, but require GPU hardware for reasonable speed at scale. A single GPU can embed thousands of documents per minute. A CPU-only setup is 10-50x slower.

Benchmarks. The MTEB (Massive Text Embedding Benchmark) leaderboard ranks embedding models across dozens of tasks: retrieval, classification, clustering, re-ranking. Check it before choosing. The best model for retrieval isn’t necessarily the best for classification.

Vector Databases

Once you have embeddings, you need somewhere to store and search them efficiently. A million vectors with 1,536 dimensions can’t be brute-force compared on every query. Vector databases use approximate nearest neighbor (ANN) algorithms (HNSW, IVF, ScaNN) that trade a tiny amount of accuracy for massive speed gains.

Pinecone, Weaviate, Qdrant, Chroma, Milvus are purpose-built vector databases. They handle indexing, searching, filtering, and scaling. They support metadata filtering (search only documents from a specific department or date range), which is essential for production use.

PostgreSQL with pgvector adds vector search to an existing Postgres database. This is compelling if you already use Postgres. You keep your existing data model and add vector columns alongside regular columns. The tradeoff: it’s not as fast as purpose-built options at scale, but it’s one less system to deploy and manage.

The right choice depends on scale. Under 100K documents, almost anything works, including a flat array with brute-force search. At 1M+ documents, you need proper ANN indexing. At 100M+, you need a distributed solution.

What Makes Embeddings Good or Bad

An embedding model is only as good as what it learned during training. If the training data didn’t include legal text, the model’s embeddings won’t capture legal nuance. If it wasn’t trained on code, it won’t distinguish between similar functions that do very different things.

You can evaluate embedding quality for your specific use case. Take 50-100 query-document pairs where you know the correct match. Embed them, run similarity search, and measure how often the correct document appears in the top 3 results (Recall@3). If it’s above 90%, your embeddings are working. If it’s below 70%, you need a better model, better chunking, or both.

Chunk size matters more than people realize. Embed a 2,000-token passage and the embedding represents the average meaning of the whole passage. If the passage covers three different topics, the embedding is a muddy blend of all three and matches none of them precisely. Embed a 200-token paragraph and the embedding is focused, specific, and matches queries about that specific topic.

The Practical Takeaway

Embeddings are the connective tissue of modern AI systems. They power search, RAG, recommendations, anomaly detection, and classification. The concept is simple: text goes in, numbers come out, similar meaning produces similar numbers. The complexity is in the details: choosing the right model, chunking documents well, picking the right vector database, and evaluating whether your embeddings actually capture the distinctions that matter for your use case.

Chapter 2 of Get Insanely Good at AI covers embeddings from the ground up, including how transformer models learn representations, practical strategies for choosing models and tuning retrieval, and the architecture behind vector search systems.