FAQ: How Do I Run Embedding Models at Scale for RAG Pipelines?

Embedding models are the backbone of every RAG (Retrieval-Augmented Generation) system, semantic search engine, and recommendation pipeline. They're also shockingly cheap to run on GPUs — a single RTX 4090 on io.net can generate 10,000+ embeddings per second for models like BGE-large or E5-large, making it one of the highest-ROI GPU workloads you can deploy.

The models are small (330M-560M parameters, 1-2GB VRAM), so you're paying for throughput, not memory. And because embedding is embarrassingly parallel — every document is independent — it scales linearly with more GPUs.

Choosing an Embedding Model

The landscape moves fast, but these are the workhorses in production right now:

Model	Dimensions	VRAM	Throughput (4090)	Quality (MTEB avg)
BGE-large-en-v1.5	1024	1.3GB	8,500 docs/sec	63.5
E5-large-v2	1024	1.3GB	8,200 docs/sec	62.4
GTE-large-en-v1.5	1024	1.3GB	7,800 docs/sec	65.4
Nomic-embed-text-v1.5	768	0.8GB	11,000 docs/sec	62.3
Cohere embed-v3 (API)	1024	N/A	N/A	64.5

Throughput measured with batch size 256, sequence length 512 tokens, FP16.

For most RAG applications, BGE-large or GTE-large hits the sweet spot of quality and speed. If you need maximum throughput and can accept slightly lower retrieval quality, Nomic's smaller model flies.

Why GPUs Crush CPUs for Embeddings

People sometimes try to run embedding models on CPUs to save money. Bad idea at any meaningful scale:

CPU (8-core): 50-100 docs/sec
RTX 4090: 8,500 docs/sec (85-170x faster)
A100: 15,000 docs/sec (150-300x faster)

At $0.18/hr for a 4090, you're paying $0.000006 per document embedded. That's $6 per million documents. A CPU instance doing the same job costs more per document because it takes 100x longer, even if the hourly rate is lower.

Scaling Architecture for Production

Small scale (< 1M embeddings/day):

One RTX 4090 handles this easily with room to spare. Run the model with a simple FastAPI wrapper:

from sentence_transformers import SentenceTransformer
from fastapi import FastAPI

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
app = FastAPI()

@app.post("/embed")
async def embed(texts: list[str]):
    embeddings = model.encode(texts, batch_size=256, normalize_embeddings=True)
    return {"embeddings": embeddings.tolist()}

Monthly cost: $129.60 (one 4090 running 24/7).

Medium scale (1M-100M embeddings/day):

Multiple GPUs behind a load balancer. Each GPU runs an independent model replica — no cross-GPU communication needed, so scaling is perfectly linear.

10M docs/day: 2x RTX 4090 ($259.20/month)
50M docs/day: 7x RTX 4090 ($907.20/month)
100M docs/day: 14x RTX 4090 ($1,814.40/month)

Compare to OpenAI's embedding API at $0.13 per million tokens: 100M documents averaging 200 tokens each = 20B tokens = $2,600/month. Self-hosting on io.net saves 30% and gives you zero rate limits.

Large scale (100M+ embeddings/day):

At this volume, consider:
- Quantized models (INT8) for 1.5x throughput with minimal quality loss
- A100 GPUs for higher per-card throughput (15K docs/sec)
- Batched processing during off-peak hours with results cached in a vector database

Embedding Pipeline Best Practices

Batch aggressively. Single-document embedding wastes 90% of GPU capacity. Accumulate requests and process in batches of 128-512. Most frameworks handle this automatically, but make sure your API layer isn't forwarding one request at a time.

Pre-chunk your documents. Embedding models have a max sequence length (typically 512 tokens). Longer documents need chunking. Do this before the GPU step — chunking is CPU work that shouldn't block the GPU pipeline. Chunk at 256-512 tokens with 50-token overlap for best retrieval quality.

Cache embeddings. If the same document gets embedded twice, that's wasted compute. Store embeddings in your vector database (Pinecone, Qdrant, Weaviate, pgvector) and only re-embed when content changes.

Use FP16 for inference. Embedding models lose essentially zero quality at half precision, and it doubles your throughput while halving VRAM usage.

Run embedding pipelines on io.net — 10,000+ docs/sec on a $0.18/hr RTX 4090. Start embedding