Skip to main content
← All posts
4 min read

Embedding Models: Which One, and Why It Matters Less Than You Think

Embedding model choice is a 5% problem for most RAG systems. Your chunking strategy is the 50% problem. Here's how to pick anyway.

Share

You're building RAG. You've spent two days reading benchmarks (MTEB, BEIR, etc.) trying to pick the right embedding model. You're agonizing between OpenAI's text-embedding-3-large, Voyage-3, Cohere embed-v3, and BGE-M3.

Stop. None of this matters as much as you think it does.

For most RAG systems, the embedding model is a 5% problem. Your chunking strategy is the 50% problem. Your retrieval evaluation is the 30% problem. The model is what you optimize last.

What embedding model choice actually changes

A meaningfully better embedding model on the right task improves retrieval recall by 5-15%. That sounds like a lot. In practice it means:

  • Top-5 recall goes from 78% to 85%
  • Top-20 recall goes from 92% to 96%

If your downstream LLM consumes top-20, this is barely visible. If it consumes top-3, you'll feel it.

Compare with: chunking strategy, where switching from naive 512-token chunks to semantic chunks (or paragraph-aware) can improve recall by 30%. That's the bigger lever.

The pragmatic shortlist

Three families that cover 95% of cases:

OpenAI text-embedding-3-small ($0.02/MTok)

  • Cheap, fast, supports dimension reduction (512, 1024 instead of 1536)
  • Good general-purpose performance
  • API-only — you can't self-host

Voyage-3 / Voyage-3-large

  • Strong on technical content (code, scientific docs)
  • Higher cost per token but excellent recall
  • API-only

BGE-M3 / BGE-large

  • Open-weight, run locally
  • Multilingual support
  • Bring-your-own-infra cost (one A10 GPU runs it for free if you're already paying for the box)
  • Slightly behind frontier models on English benchmarks but close

For most teams: start with OpenAI text-embedding-3-small. It's cheap, fast, and the integration is one line. Optimize later if recall is a measurable problem.

When to upgrade beyond the default

Three scenarios that justify deeper investment:

1. Your domain is specialized. Legal text, medical records, code, scientific papers. General models underperform. Test domain-specific (Voyage code, BioBERT, etc.) or fine-tune.

2. You need on-prem. Compliance reasons, latency, cost at very high volume. Open-weight models (BGE, GTE, Stella) are required.

3. You've measured a recall problem. Your eval set shows the right docs aren't retrieved. The fix might be the embedding model. More often it's chunking or re-ranking.

If none of these apply, default model is fine.

What you actually need to set up first

Before agonizing over model choice:

1. An eval set. 50-200 query/document pairs you've manually labeled. "Given this question, which docs in our corpus should appear in top 5?" Without this, you're vibes-only on improvements.

2. A baseline. Pick any embedding model. Measure recall@5, recall@20, and mean reciprocal rank. Note the numbers.

3. The right chunking. Try 256, 512, 1024 token chunks. Try semantic (split on paragraph or section breaks). Measure each. The right answer depends on your content.

4. A re-ranker. A reranker (Cohere rerank-3, Voyage rerank-1, or open-weight bge-reranker) takes top-50 candidates and re-scores them. This typically adds 10-20 points of relevance.

Steps 1-4 will improve your RAG more than 3 weeks of embedding model A/B testing.

Dimensions: smaller is fine

A common mistake: assuming higher-dimensional embeddings are better.

Higher dims = more storage, more memory, slower search, marginally better recall.

For most tasks, 512-1024 dims is plenty. OpenAI's text-embedding-3 supports dimension reduction (request 512 or 1024 instead of 1536) with minimal recall loss. Use it.

The exception: very large corpora (>10M docs) where you're already pushing search latency. Then dim reduction trades recall for speed. Measure.

Hybrid search is the better lever

Pure vector search (dense embeddings) underperforms on:

  • Exact-match queries ("error code 5023")
  • Rare technical terms
  • Acronyms

Pure keyword search (BM25) underperforms on:

  • Conceptual queries ("how do I make this faster")
  • Paraphrased terms

Hybrid search combines both. Reciprocal Rank Fusion (RRF) is a simple, effective merge. Most vector DBs support it natively (Weaviate, Qdrant, Elastic).

Going hybrid usually adds 10-20 points of recall. That's worth more than swapping embedding models.

The cost angle

For high-volume embedding ingestion (millions of docs):

  • text-embedding-3-small: ~$20 per million docs (assuming 500 tokens avg)
  • text-embedding-3-large: ~$130 per million docs
  • Voyage-3-large: ~$180 per million docs
  • BGE-M3 self-hosted: ~$0 if you already have GPUs

For a 10M-doc corpus, the OpenAI bill is $200-1300 once. Then it's just the query-time cost (small). This usually isn't a deciding factor.

What I actually recommend

For 80% of teams building RAG today:

  1. text-embedding-3-small (1024 dim) for embeddings
  2. Cohere rerank-3 (or Voyage rerank-1) for re-ranking top 50 → top 10
  3. Hybrid search (BM25 + dense) using your vector DB's built-in fusion
  4. Eval set of ~100 hand-labeled queries to measure changes

Total setup time: a day. Total cost at small scale: ~$30/month.

If you have specific reasons to deviate (privacy, domain, cost at scale), deviate. Otherwise: stop reading benchmarks and ship something.

The takeaway

Embedding model choice is a real but small lever. Spending more than a few hours picking is a sign you're avoiding the bigger work — chunking, eval, hybrid search, re-ranking. Pick a default, measure, improve where the metrics tell you to.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.