FAQ: How Do I Run Whisper Speech-to-Text at Scale on Cloud GPUs?

OpenAI's Whisper is the gold standard for open-source speech-to-text, and running it at scale on GPU cloud is surprisingly affordable. A single RTX 4090 on io.net ($0.18/hr) transcribes audio at 30-50x real-time speed with Whisper large-v3 — meaning one hour of audio takes about 90 seconds to process. At that rate, you can transcribe 960 hours of audio per day on a single $0.18/hr GPU.

The trick to scaling Whisper isn't just throwing more GPUs at it. It's choosing the right model size, using faster-whisper (CTranslate2 backend) instead of the vanilla OpenAI implementation, and batching your audio pipeline intelligently.

Choosing the Right Whisper Model

Not every audio file needs the biggest model:

Model	Parameters	VRAM	Speed (RTX 4090)	WER (English)	Best For
tiny	39M	<1GB	180x real-time	7.6%	Quick previews, low-quality audio
base	74M	<1GB	140x real-time	5.4%	Casual transcription
small	244M	~2GB	90x real-time	3.4%	Good quality, high throughput
medium	769M	~5GB	50x real-time	2.9%	Professional use
large-v3	1.55B	~10GB	30x real-time	2.0%	Maximum accuracy

For most production use cases, small or medium deliver the best speed-to-accuracy tradeoff. large-v3 is worth it only when transcription quality is paramount (medical, legal, media production).

The faster-whisper Advantage

Drop-in replacement, 3-4x faster. The faster-whisper library uses CTranslate2 under the hood, which applies INT8 quantization and optimized kernels:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Performance comparison on RTX 4090:

Implementation	Model	Speed	VRAM
OpenAI whisper	large-v3	30x real-time	10GB
faster-whisper (FP16)	large-v3	70x real-time	5GB
faster-whisper (INT8)	large-v3	95x real-time	3GB

With faster-whisper INT8, a single RTX 4090 transcribes ~2,280 hours of audio per day. That's the entire podcast output of a major media company, for $4.32/day.

Scaling Architecture

For batch processing (podcasts, meeting recordings, archives):

Audio Files → S3/GCS → Job Queue (Redis) → GPU Workers → Transcripts → Storage

Each worker pulls audio from the queue, transcribes it, pushes results to storage. Scale workers based on queue depth.

Sizing:
- 10K hours/day: 5x RTX 4090 ($21.60/day)
- 100K hours/day: 45x RTX 4090 ($194.40/day)
- 1M hours/day: 450x RTX 4090 ($1,944/day) — or use cheaper GPUs

For real-time streaming (live captions, call centers):

Audio Stream → Chunker (30s segments) → GPU Pool → WebSocket → Client

Real-time needs sub-second latency. Use small or medium models for speed, dedicate one GPU per 20-40 concurrent streams, and chunk audio into 30-second segments with 2-second overlap for context continuity.

Sizing:
- 100 concurrent streams: 3-5x RTX 4090 ($12.96-$21.60/day)
- 1,000 concurrent streams: 25-50x RTX 4090 ($108-$216/day)

Cost Comparison: Self-Hosted vs API

Provider	Price per hour of audio	10K hours/month
OpenAI Whisper API	$0.36/hr	$3,600
Google Speech-to-Text	$0.96-$1.44/hr	$9,600-$14,400
AWS Transcribe	$0.72-$1.44/hr	$7,200-$14,400
io.net (faster-whisper, RTX 4090)	$0.0019/hr	$19

Self-hosting on io.net is 190x cheaper than OpenAI's API and 500x cheaper than Google. Even accounting for engineering time to set up the pipeline, the ROI is obvious above ~100 hours of audio per month.

Production Tips

Pre-process audio before sending to GPU. Convert to 16kHz mono WAV, trim silence, normalize volume. This reduces processing time by 10-20% and improves accuracy.

Use VAD (Voice Activity Detection) to skip silence. faster-whisper includes Silero VAD integration. Skip silent segments entirely — for typical meeting recordings with 30-40% silence, this nearly doubles effective throughput.

Batch short files together. If you're transcribing thousands of 30-second voicemails, batch them into longer pseudo-files to amortize model inference overhead. The GPU setup cost per chunk is fixed; longer chunks are more efficient.

Transcribe audio at scale on io.net — faster-whisper on RTX 4090 for $0.002 per audio hour. Deploy now