Skip to main content

Lesson 2 · 11 min

Embedding models — what to actually use

Closed-source vs open, dense vs sparse vs hybrid, multilingual coverage, dimensions. The 2026 picks and the trade-offs.

The 2026 leaderboard, simplified

For English + general domains:

  • OpenAI text-embedding-3-large (3072 dims). Strong on most benchmarks. Paid.
  • Cohere embed-english-v3 (1024 dims). Strong on retrieval. Paid.
  • Voyage voyage-3 (1024 dims). Currently top of MTEB retrieval. Paid.
  • BGE-M3 (1024 dims, open source). Multilingual + dense + sparse + multi-vector in one model. Often best when self-hosting.
  • E5-mistral-7b-instruct (4096 dims, open source). Highest quality on niche tasks; expensive to host.

For multilingual: BGE-M3 is the default. It covers 100+ languages competitively.

For code: voyage-code-3 or BGE-M3 with code corpus. General-purpose embedding models lose 20-30 points on code retrieval vs code-specialized ones.