Quick Answer
Yes, io.net is optimized for AI inference workloads from small-scale API serving to high-throughput production deployments. You can run LLM inference (vLLM, TensorRT-LLM), image generation (Stable Diffusion, ComfyUI), speech recognition (Whisper), and custom models with autoscaling from 1 to 100+ GPUs. RTX 4090 GPUs at $0.18/hr provide excellent price-performance for inference (90% of H100 throughput at 8% of the cost), while H100s excel at high-concurrency serving. io.net's per-second billing and instant spin-up (<60 seconds) make it ideal for variable traffic patterns: scale to 20 GPUs during peak hours, scale to 2 GPUs overnight, and pay only for actual usage. Expect 75% cost savings vs. AWS SageMaker for equivalent inference throughput.
Inference Use Cases Supported
Large Language Model (LLM) Inference:
- Llama 3 8B/70B, Mistral 7B, GPT-style models
- Optimizations: vLLM, TensorRT-LLM, Text Generation Inference (TGI)
- Throughput: 50-500 tokens/sec depending on GPU and model size
- Cost: $0.00008-0.0003 per 1K tokens (vs. $0.0006-0.002 on AWS)
Image Generation:
- Stable Diffusion XL, SDXL Turbo, ControlNet
- ComfyUI and Automatic1111 one-click deployments
- Throughput: 0.6-2.5 images/sec on RTX 4090
- Cost: $0.001-0.003 per image (vs. $0.008-0.02 on Replicate)
Speech-to-Text:
- Whisper Large V3, Faster Whisper
- Real-time transcription and batch processing
- Throughput: 10-50x real-time on H100
- Cost: $0.0001-0.0005 per minute of audio
Computer Vision:
- Object detection (YOLO, DETR), segmentation, classification
- Video understanding and analysis
- Throughput: 30-500 FPS depending on model and GPU
- Cost: $0.00001-0.0001 per frame
Multimodal Models:
- CLIP, BLIP, LLaVA for vision-language tasks
- Text-to-image, image-to-text, visual question answering
- Cost-efficient deployment on RTX 4090 or L40S
GPU Recommendations for Inference
Choose the right GPU based on model size, throughput needs, and budget:
For LLM Inference (Llama 3 8B):
| GPU | Tokens/sec (FP16) | Tokens/sec (INT8) | Cost/hr | Cost per 1M tokens |
|---|---|---|---|---|
| H100 SXM | 142 | 385 | $2.20 | $0.00157 |
| H100 PCIe | 118 | 320 | $1.49 | $0.00129 |
| A100 80GB | 95 | 178 | $1.49 | $0.00169 |
| L40S | 87 | 165 | $0.75 | $0.00086 |
| RTX 4090 | 82 | 156 | $0.18 | $0.00023 ⭐ |
Best value: RTX 4090 at $0.00023 per 1M tokens (7x cheaper than H100)
For Llama 3 70B Inference:
| GPU | Tokens/sec (FP16) | Tokens/sec (INT8) | Cost/hr | Fit in VRAM? |
|---|---|---|---|---|
| H100 SXM | 48 | 125 | $2.20 | ✅ Yes (80GB) |
| H100 PCIe | 41 | 108 | $1.49 | ✅ Yes (80GB) |
| A100 80GB | 32 | 74 | $1.49 | ✅ Yes (80GB) |
| RTX 4090 | N/A | ~12 (quantized) | $0.18 | ⚠️ Requires quantization (24GB) |
Best choice: H100 PCIe at $1.49/hr for production 70B serving
For Stable Diffusion XL:
| GPU | Images/sec (512x512) | Images/sec (1024x1024) | Cost/hr | Cost per 1000 images |
|---|---|---|---|---|
| H100 | 2.5 | 1.2 | $2.20 | $0.88 |
| RTX 4090 | 1.7 | 0.8 | $0.18 | $0.11 ⭐ |
| A100 | 1.4 | 0.7 | $1.49 | $1.06 |
| RTX 3090 | 0.9 | 0.4 | $0.28 | $0.31 |
Best value: RTX 4090 at $0.11 per 1000 images (8x cheaper than H100)
Inference Optimization Frameworks
io.net provides pre-optimized containers for maximum inference performance:
vLLM (Recommended for LLMs):
# Launch vLLM-optimized Llama 3 8B serving
io launch --gpu RTX4090 --image vllm/vllm-openai:latest
# Inside container:
vllm serve meta-llama/Llama-3-8B \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--tensor-parallel-size 1
# OpenAI-compatible API endpoint
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-8B",
"prompt": "Explain quantum computing",
"max_tokens": 256
}'
# Throughput: ~82 tokens/sec on RTX 4090
# Latency: ~15ms time-to-first-token
TensorRT-LLM (Maximum Performance):
# 2x faster than vLLM, but requires model conversion
io launch --gpu H100 --image tensorrt-llm:latest
# Convert model to TensorRT
python convert_checkpoint.py \
--model meta-llama/Llama-3-8B \
--output /models/llama3-trt
# Serve with TensorRT
mpirun -n 1 python triton_server.py \
--model /models/llama3-trt \
--max-batch-size 128
# Throughput: ~180 tokens/sec on H100 (vs. 142 with vLLM)
Text Generation Inference (TGI):
# HuggingFace's optimized inference server
io launch --gpu A100 --image ghcr.io/huggingface/text-generation-inference:latest
docker run -p 8080:80 \
-v /data:/data \
--env MODEL_ID=meta-llama/Llama-3-8B \
ghcr.io/huggingface/text-generation-inference:latest
# Easy deployment, good performance (80-90% of vLLM)
Triton Inference Server (Multi-Model):
# Serve multiple models on one GPU
io launch --gpu L40S --image nvcr.io/nvidia/tritonserver:latest
# Serve Llama 3 8B + Stable Diffusion + Whisper simultaneously
# Dynamic batching and model routing
Autoscaling for Variable Traffic
io.net's per-second billing makes autoscaling cost-effective:
Example: LLM API with Variable Traffic
# Auto-scale from 2 to 20 GPUs based on request queue depth
from ray import serve
from ray.serve.scaling_config import AutoscalingConfig
@serve.deployment(
autoscaling_config=AutoscalingConfig(
min_replicas=2,
max_replicas=20,
target_ongoing_requests=10
),
ray_actor_options={"num_gpus": 1}
)
class LlamaInference:
def __init__(self):
from vllm import LLM
self.model = LLM("meta-llama/Llama-3-8B")
def __call__(self, prompt: str):
return self.model.generate(prompt)
serve.run(LlamaInference.bind())
Cost Breakdown (RTX 4090 at $0.18/hr):
| Time | Traffic | GPUs Active | Cost/hr | Daily Cost |
|---|---|---|---|---|
| 9am-5pm (peak) | 5000 req/hr | 18 GPUs | $3.24 | $25.92 (8hr) |
| 5pm-midnight | 1000 req/hr | 6 GPUs | $1.08 | $7.56 (7hr) |
| midnight-9am | 200 req/hr | 2 GPUs | $0.36 | $3.24 (9hr) |
| Total | Avg: 8 GPUs | $36.72/day |
vs. Running 18 GPUs 24/7:
- Fixed 18 GPUs: $0.18 × 18 × 24 = $77.76/day
- Autoscaling: $36.72/day
- Savings: $41.04/day (53%)
vs. AWS SageMaker Autoscaling:
- AWS ml.g5.xlarge (A10G): $1.01/hr × avg 12 instances × 24hr = $291/day
- io.net: $36.72/day
- Savings: $254.28/day (87%)
Real-World Inference Deployments
Case Study 1: Chatbot Startup (Llama 3 8B)
- Traffic: 50K requests/day (peak 200 req/min, avg 35 req/min)
- Model: Llama 3 8B with vLLM
- Configuration: 4-12x RTX 4090 with autoscaling
- Average usage: 6 GPUs (autoscale based on queue depth)
- Throughput: ~500 tokens/sec aggregate (6 GPUs × 82 tokens/sec)
- Monthly cost: $0.18/hr × 6 GPUs × 730 hrs = $788/month
- AWS equivalent: SageMaker with 6x ml.g5.xlarge = $4,424/month
- Savings: $3,636/month (82%)
Case Study 2: Image Generation SaaS (Stable Diffusion XL)
- Traffic: 100K images/day (variable: 1K/hr night, 8K/hr peak)
- Model: SDXL with ComfyUI
- Configuration: 2-16x RTX 4090 with autoscaling
- Average usage: 8 GPUs
- Throughput: ~14 images/sec aggregate (8 GPUs × 1.7 images/sec)
- Monthly cost: $0.18/hr × 8 GPUs × 730 hrs = $1,051/month
- Replicate equivalent: ~$0.015/image × 3M images/month = $45,000/month
- Savings: $43,949/month (98%)
Case Study 3: Video Transcription Service (Whisper Large V3)
- Traffic: 50,000 minutes of audio/day
- Model: Faster Whisper (optimized)
- Configuration: 4x L40S GPUs
- Throughput: 30x real-time (process 30 min of audio in 1 minute)
- Daily processing: 4 GPUs × 30x × 1440 min/day = 172,800 minutes capacity
- Actual usage: 50K min/day = 29% utilization
- Monthly cost: $0.75/hr × 4 GPUs × 730 hrs = $2,190/month
- AWS Transcribe: $0.024/min × 1.5M min/month = $36,000/month
- Savings: $33,810/month (94%)
Latency and Performance Optimization
Latency Breakdown (Llama 3 8B on RTX 4090):
| Metric | vLLM | TensorRT-LLM | Optimization |
|---|---|---|---|
| Time to First Token (TTFT) | 15ms | 8ms | Lower is better for chat |
| Inter-token Latency | 12ms | 7ms | Affects streaming experience |
| End-to-End (256 tokens) | 3.2s | 1.9s | Overall request time |
| Throughput (tokens/sec) | 82 | 135 | Batch processing capacity |
Optimizations for Low Latency:
- Enable Flash Attention 2
# 2x faster attention for transformers
from vllm import LLM
model = LLM("meta-llama/Llama-3-8B", use_flash_attention=True)
- Use INT8 Quantization
# 2x throughput with minimal quality loss
model = LLM("meta-llama/Llama-3-8B", quantization="int8")
- Tune Max Batch Size
# Balance latency vs. throughput
model = LLM("meta-llama/Llama-3-8B", max_num_batched_tokens=8192)
# Lower = lower latency, higher = higher throughput
- Regional Deployment
# Deploy GPUs close to users
io launch --gpu RTX4090 --region us-west # West Coast users
io launch --gpu RTX4090 --region eu-west # European users
# Reduces network latency by 50-150ms
- Use Speculative Decoding
# Draft model generates, main model verifies (1.5-2x speedup)
model = LLM("meta-llama/Llama-3-70B", speculative_model="meta-llama/Llama-3-8B")
Batch Inference for High Throughput
For non-interactive workloads (batch processing, embeddings), maximize throughput:
# Process 10,000 prompts in batch
from vllm import LLM
model = LLM("meta-llama/Llama-3-8B")
prompts = [f"Summarize: {text}" for text in documents] # 10K prompts
outputs = model.generate(prompts, max_tokens=128)
# Throughput: ~2,000 prompts/hr on single RTX 4090
# Cost: $0.18/hr ÷ 2000 = $0.00009 per summary
Batch Inference Pricing:
| Workload | GPU | Throughput | Cost/hr | Cost per 1000 Outputs |
|---|---|---|---|---|
| Text summarization (128 tokens) | RTX 4090 | 2000/hr | $0.18 | $0.09 |
| Embeddings (sentence-transformers) | RTX 4090 | 50K/hr | $0.18 | $0.0036 |
| Image classification (ResNet-50) | RTX 4090 | 100K/hr | $0.18 | $0.0018 |
| SD image generation (512x512) | RTX 4090 | 6,000/hr | $0.18 | $0.03 |
Multi-GPU Inference for High Concurrency
Serve large models across multiple GPUs:
# Llama 3 70B across 2x H100 (tensor parallelism)
io launch --gpu H100 --count 2 --network nvlink
# vLLM with tensor parallelism
vllm serve meta-llama/Llama-3-70B \
--tensor-parallel-size 2 \
--max-model-len 8192
# Throughput: 96 tokens/sec (2x H100)
# Cost: $4.40/hr (2x $2.20)
# vs. 4x RTX 4090 (quantized): ~48 tokens/sec, $0.72/hr
When to use multi-GPU inference:
- Models >40GB that don't fit in single GPU
- Extremely high concurrency (100+ simultaneous users)
- Low latency requirements (<10ms TTFT)
When to use multiple single-GPU instances:
- Better price-performance for most workloads
- Easier autoscaling (scale 1 GPU at a time)
- Fault tolerance (one GPU failure doesn't take down entire service)
Monitoring and Observability
Track inference performance in real-time:
# io.net dashboard shows:
# - Requests per second
# - Tokens per second
# - GPU utilization
# - Cost per 1K tokens
# - Latency p50/p95/p99
io dashboard io-llama-inference-cluster
Integrate with monitoring tools:
# Prometheus metrics
from prometheus_client import Counter, Histogram
requests_total = Counter('inference_requests_total', 'Total requests')
latency = Histogram('inference_latency_seconds', 'Request latency')
@latency.time()
def run_inference(prompt):
requests_total.inc()
return model.generate(prompt)
Related Questions
What's the cheapest way to run LLM inference?
RTX 4090 at $0.18/hr provides the best price-performance: 82 tokens/sec for Llama 3 8B = $0.00023 per 1M tokens. This is 7x cheaper than H100 and 20-30x cheaper than AWS SageMaker or OpenAI API ($0.005-0.015 per 1K tokens). For 70B models, quantized inference on 4x RTX 4090 ($0.72/hr) is more cost-effective than 1x H100 ($2.20/hr) if you can tolerate slight quality loss from quantization.
Can io.net handle real-time inference with <100ms latency?
Yes. Time-to-first-token (TTFT) is 8-15ms for optimized LLMs on H100/RTX 4090. Total latency (TTFT + generation) for 256-token responses is 1.9-3.2 seconds. For ultra-low latency (<50ms end-to-end), use smaller models (Llama 3 8B vs. 70B), enable speculative decoding, and deploy GPUs in the same region as your users. For <10ms latency, consider TensorRT-LLM with FP8 on H100.
How does inference pricing compare to OpenAI/Anthropic APIs?
io.net self-hosted inference is 20-60x cheaper for high-volume workloads. Llama 3 8B on RTX 4090: $0.00023 per 1K tokens. OpenAI GPT-3.5 Turbo: $0.0015 per 1K tokens (6x more expensive). Claude 3 Haiku: $0.0008 per 1K tokens (3.5x more expensive). Break-even point: ~1-5M tokens/month. Below that, API services are cheaper due to no infrastructure management. Above that, self-hosted wins.
Can I run Stable Diffusion inference at scale on io.net?
Yes. RTX 4090 generates SDXL images at 1.7 images/sec (512x512) or 0.8 images/sec (1024x1024) for $0.18/hr. This translates to $0.11 per 1,000 images - 10-50x cheaper than Replicate ($0.015/image) or Stability AI API ($0.02/image). For 100K images/day, use 8-16x RTX 4090 with autoscaling (~$1,000-2,000/month vs. $45,000/month on Replicate).
Do I need to manage inference serving infrastructure myself?
Partially. io.net provides the GPU infrastructure, but you deploy your own inference server (vLLM, TGI, Triton). Use pre-built containers for quick setup: io launch --gpu RTX4090 --image vllm/vllm-openai:latest gives you an OpenAI-compatible API in 60 seconds. For fully managed inference, consider io.net's Enterprise Managed Inference (launching Q3 2026) which handles deployment, monitoring, and autoscaling for you.
Deploy Inference Workloads on io.net
Start serving AI models 70% cheaper than AWS:
- RTX 4090 at $0.18/hr - Best price-performance for inference
- vLLM pre-configured - OpenAI-compatible API in 60 seconds
- Autoscaling - Scale from 1 to 100+ GPUs based on traffic
- Per-second billing - Pay only for active inference time
Launch inference GPU → or view inference pricing →
Last updated: April 2026 | Inference benchmarks based on vLLM 0.4.2 with optimized settings
