OpenAI's Whisper is the gold standard for open-source speech-to-text, and running it at scale on GPU cloud is surprisingly affordable. A single RTX 4090 on io.net ($0.18/hr) transcribes audio at 30-50x real-time speed with Whisper large-v3 — meaning one hour of audio takes about 90 seconds to process. At that rate, you can transcribe 960 hours of audio per day on a single $0.18/hr GPU.
The trick to scaling Whisper isn't just throwing more GPUs at it. It's choosing the right model size, using faster-whisper (CTranslate2 backend) instead of the vanilla OpenAI implementation, and batching your audio pipeline intelligently.
Choosing the Right Whisper Model
Not every audio file needs the biggest model:
| Model | Parameters | VRAM | Speed (RTX 4090) | WER (English) | Best For |
|---|---|---|---|---|---|
| tiny | 39M | <1GB | 180x real-time | 7.6% | Quick previews, low-quality audio |
| base | 74M | <1GB | 140x real-time | 5.4% | Casual transcription |
| small | 244M | ~2GB | 90x real-time | 3.4% | Good quality, high throughput |
| medium | 769M | ~5GB | 50x real-time | 2.9% | Professional use |
| large-v3 | 1.55B | ~10GB | 30x real-time | 2.0% | Maximum accuracy |
For most production use cases, small or medium deliver the best speed-to-accuracy tradeoff. large-v3 is worth it only when transcription quality is paramount (medical, legal, media production).
The faster-whisper Advantage
Drop-in replacement, 3-4x faster. The faster-whisper library uses CTranslate2 under the hood, which applies INT8 quantization and optimized kernels:
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Performance comparison on RTX 4090:
| Implementation | Model | Speed | VRAM |
|---|---|---|---|
| OpenAI whisper | large-v3 | 30x real-time | 10GB |
| faster-whisper (FP16) | large-v3 | 70x real-time | 5GB |
| faster-whisper (INT8) | large-v3 | 95x real-time | 3GB |
With faster-whisper INT8, a single RTX 4090 transcribes ~2,280 hours of audio per day. That's the entire podcast output of a major media company, for $4.32/day.
Scaling Architecture
For batch processing (podcasts, meeting recordings, archives):
Audio Files → S3/GCS → Job Queue (Redis) → GPU Workers → Transcripts → Storage
Each worker pulls audio from the queue, transcribes it, pushes results to storage. Scale workers based on queue depth.
Sizing:
- 10K hours/day: 5x RTX 4090 ($21.60/day)
- 100K hours/day: 45x RTX 4090 ($194.40/day)
- 1M hours/day: 450x RTX 4090 ($1,944/day) — or use cheaper GPUs
For real-time streaming (live captions, call centers):
Audio Stream → Chunker (30s segments) → GPU Pool → WebSocket → Client
Real-time needs sub-second latency. Use small or medium models for speed, dedicate one GPU per 20-40 concurrent streams, and chunk audio into 30-second segments with 2-second overlap for context continuity.
Sizing:
- 100 concurrent streams: 3-5x RTX 4090 ($12.96-$21.60/day)
- 1,000 concurrent streams: 25-50x RTX 4090 ($108-$216/day)
Cost Comparison: Self-Hosted vs API
| Provider | Price per hour of audio | 10K hours/month |
|---|---|---|
| OpenAI Whisper API | $0.36/hr | $3,600 |
| Google Speech-to-Text | $0.96-$1.44/hr | $9,600-$14,400 |
| AWS Transcribe | $0.72-$1.44/hr | $7,200-$14,400 |
| io.net (faster-whisper, RTX 4090) | $0.0019/hr | $19 |
Self-hosting on io.net is 190x cheaper than OpenAI's API and 500x cheaper than Google. Even accounting for engineering time to set up the pipeline, the ROI is obvious above ~100 hours of audio per month.
Production Tips
Pre-process audio before sending to GPU. Convert to 16kHz mono WAV, trim silence, normalize volume. This reduces processing time by 10-20% and improves accuracy.
Use VAD (Voice Activity Detection) to skip silence. faster-whisper includes Silero VAD integration. Skip silent segments entirely — for typical meeting recordings with 30-40% silence, this nearly doubles effective throughput.
Batch short files together. If you're transcribing thousands of 30-second voicemails, batch them into longer pseudo-files to amortize model inference overhead. The GPU setup cost per chunk is fixed; longer chunks are more efficient.
Transcribe audio at scale on io.net — faster-whisper on RTX 4090 for $0.002 per audio hour. Deploy now
