OpenAI's Whisper is the gold standard for open-source speech-to-text, and running it at scale on GPU cloud is surprisingly affordable. A single RTX 4090 on io.net ($0.18/hr) transcribes audio at 30-50x real-time speed with Whisper large-v3 — meaning one hour of audio takes about 90 seconds to process. At that rate, you can transcribe 960 hours of audio per day on a single $0.18/hr GPU.

The trick to scaling Whisper isn't just throwing more GPUs at it. It's choosing the right model size, using faster-whisper (CTranslate2 backend) instead of the vanilla OpenAI implementation, and batching your audio pipeline intelligently.

Choosing the Right Whisper Model

Not every audio file needs the biggest model:

ModelParametersVRAMSpeed (RTX 4090)WER (English)Best For
tiny39M<1GB180x real-time7.6%Quick previews, low-quality audio
base74M<1GB140x real-time5.4%Casual transcription
small244M~2GB90x real-time3.4%Good quality, high throughput
medium769M~5GB50x real-time2.9%Professional use
large-v31.55B~10GB30x real-time2.0%Maximum accuracy

For most production use cases, small or medium deliver the best speed-to-accuracy tradeoff. large-v3 is worth it only when transcription quality is paramount (medical, legal, media production).

The faster-whisper Advantage

Drop-in replacement, 3-4x faster. The faster-whisper library uses CTranslate2 under the hood, which applies INT8 quantization and optimized kernels:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Performance comparison on RTX 4090:

ImplementationModelSpeedVRAM
OpenAI whisperlarge-v330x real-time10GB
faster-whisper (FP16)large-v370x real-time5GB
faster-whisper (INT8)large-v395x real-time3GB

With faster-whisper INT8, a single RTX 4090 transcribes ~2,280 hours of audio per day. That's the entire podcast output of a major media company, for $4.32/day.

Scaling Architecture

For batch processing (podcasts, meeting recordings, archives):

Audio Files → S3/GCS → Job Queue (Redis) → GPU Workers → Transcripts → Storage

Each worker pulls audio from the queue, transcribes it, pushes results to storage. Scale workers based on queue depth.

Sizing:
- 10K hours/day: 5x RTX 4090 ($21.60/day)
- 100K hours/day: 45x RTX 4090 ($194.40/day)
- 1M hours/day: 450x RTX 4090 ($1,944/day) — or use cheaper GPUs

For real-time streaming (live captions, call centers):

Audio Stream → Chunker (30s segments) → GPU Pool → WebSocket → Client

Real-time needs sub-second latency. Use small or medium models for speed, dedicate one GPU per 20-40 concurrent streams, and chunk audio into 30-second segments with 2-second overlap for context continuity.

Sizing:
- 100 concurrent streams: 3-5x RTX 4090 ($12.96-$21.60/day)
- 1,000 concurrent streams: 25-50x RTX 4090 ($108-$216/day)

Cost Comparison: Self-Hosted vs API

ProviderPrice per hour of audio10K hours/month
OpenAI Whisper API$0.36/hr$3,600
Google Speech-to-Text$0.96-$1.44/hr$9,600-$14,400
AWS Transcribe$0.72-$1.44/hr$7,200-$14,400
io.net (faster-whisper, RTX 4090)$0.0019/hr$19

Self-hosting on io.net is 190x cheaper than OpenAI's API and 500x cheaper than Google. Even accounting for engineering time to set up the pipeline, the ROI is obvious above ~100 hours of audio per month.

Production Tips

Pre-process audio before sending to GPU. Convert to 16kHz mono WAV, trim silence, normalize volume. This reduces processing time by 10-20% and improves accuracy.

Use VAD (Voice Activity Detection) to skip silence. faster-whisper includes Silero VAD integration. Skip silent segments entirely — for typical meeting recordings with 30-40% silence, this nearly doubles effective throughput.

Batch short files together. If you're transcribing thousands of 30-second voicemails, batch them into longer pseudo-files to amortize model inference overhead. The GPU setup cost per chunk is fixed; longer chunks are more efficient.


Transcribe audio at scale on io.net — faster-whisper on RTX 4090 for $0.002 per audio hour. Deploy now