On io.net, you can serve LLM inference for as little as $0.50 per million tokens using an RTX 4090 with vLLM — roughly 10x cheaper than OpenAI's API and 3-5x cheaper than running the same workload on AWS. The exact cost depends on three things: which GPU you pick, how large the model is, and how well you've optimized your serving stack. This guide breaks down the real numbers so you can build an accurate cost model for your inference budget.
Real-World Cost-Per-Token by Model Size
We benchmarked these numbers using vLLM 0.4+ with continuous batching enabled, which is the standard production setup. All prices reflect io.net's current on-demand rates.
Small models (7-8B parameters):
| Setup | GPU | $/hr | Tokens/hr | Cost per 1M tokens |
|---|---|---|---|---|
| Llama 3 8B, FP16 | RTX 4090 | $0.18 | 342K | $0.53 |
| Llama 3 8B, FP16 | A100 80GB | $1.49 | 720K | $2.07 |
| Llama 3 8B, FP8 | H100 SXM | $2.20 | 1,368K | $1.61 |
| Mistral 7B, AWQ 4-bit | RTX 4090 | $0.18 | 480K | $0.38 |
Medium models (13-34B parameters):
| Setup | GPU | $/hr | Tokens/hr | Cost per 1M tokens |
|---|---|---|---|---|
| Llama 3 13B, GPTQ 4-bit | RTX 4090 | $0.18 | 195K | $0.92 |
| CodeLlama 34B, AWQ 4-bit | A100 80GB | $1.49 | 310K | $4.81 |
| Mixtral 8x7B, FP16 | A100 80GB | $1.49 | 280K | $5.32 |
Large models (70B+ parameters):
| Setup | GPU | $/hr | Tokens/hr | Cost per 1M tokens |
|---|---|---|---|---|
| Llama 3 70B, AWQ 4-bit | RTX 4090 | $0.18 | 48K | $3.75 |
| Llama 3 70B, FP16 | A100 80GB (2x) | $2.98 | 180K | $16.56 |
| Llama 3 70B, FP8 | H100 SXM | $2.20 | 420K | $5.24 |
The takeaway: quantized models on consumer GPUs dominate the cost curve for everything up to 70B parameters.
How This Compares to API Providers
If you're considering self-hosted inference vs. API calls, here's the breakeven math:
| Provider | Model | Price per 1M tokens (output) |
|---|---|---|
| OpenAI GPT-4o | Proprietary | $15.00 |
| Anthropic Claude 3.5 | Proprietary | $15.00 |
| OpenAI GPT-4o-mini | Proprietary | $0.60 |
| Together.ai (Llama 3 70B) | Open-source hosted | $0.90 |
| io.net self-hosted (Llama 3 8B, 4090) | Self-hosted | $0.38-0.53 |
| io.net self-hosted (Llama 3 70B, 4090) | Self-hosted | $3.75 |
Self-hosting on io.net beats API providers once you exceed roughly 50,000 tokens per hour of consistent traffic. Below that volume, the operational overhead of managing your own inference endpoint outweighs the savings.
Optimization Tricks That Cut Your Cost in Half
These aren't theoretical — they're what production teams actually do:
1. Quantize aggressively
AWQ 4-bit quantization reduces VRAM by 75% and increases throughput by 40-60%, with less than 1% quality degradation on most benchmarks. For a 7B model, this drops your cost from $0.53 to $0.38 per million tokens.
2. Use continuous batching
vLLM's PagedAttention groups incoming requests dynamically, keeping the GPU saturated at 85-95% utilization instead of the 30-40% typical of naive serving. This alone can 2x your throughput.
3. Enable speculative decoding
For autoregressive models, speculative decoding with a small draft model can improve throughput by 2-3x on the same hardware. It's not supported everywhere yet, but vLLM and TensorRT-LLM both support it.
4. Right-size your GPU
Running a 7B model on an H100 is burning money. Match the GPU to the model — the 4090 handles 7-13B models at a fraction of the cost. Save the H100 for 70B+ or latency-critical workloads.
5. Batch your requests
If your application can tolerate 100-200ms of batching delay, grouping requests before processing improves throughput by 3-5x. Works great for async use cases like document processing, email generation, and background summarization.
Building a Cost Model for Your Use Case
Here's a framework for estimating monthly inference costs:
Monthly cost = (daily_requests × avg_tokens_per_request × 30) / tokens_per_hour × hourly_gpu_cost
Example: Customer support chatbot
- 10,000 conversations/day
- 800 tokens per conversation (avg)
- Model: Llama 3 8B on RTX 4090
- Throughput: 342,000 tokens/hr
Monthly tokens: 10,000 × 800 × 30 = 240M tokens
GPU hours needed: 240M / 342K = 702 hours
Monthly cost: 702 × $0.18 = $126.36/month
Compare that to OpenAI GPT-4o-mini at $0.60/M: 240M × $0.60/M = $144/month — and you own your data, face no rate limits, and can customize the model.
Example: Document processing pipeline
- 50,000 documents/day
- 4,000 tokens per document
- Model: Llama 3 70B on H100 (quality matters)
- Throughput: 420,000 tokens/hr
Monthly tokens: 50,000 × 4,000 × 30 = 6B tokens
GPU hours needed: 6B / 420K = 14,286 hours
GPUs needed (24/7): 14,286 / 720 = ~20 H100s
Monthly cost: 20 × $2.20 × 720 = $31,680/month
At this scale, API pricing would be catastrophic: 6B × $15/M = $90,000/month for GPT-4o. Self-hosting saves $58,320/month
Run inference on io.net — from $0.38 per million tokens on RTX 4090. Start serving
