Quick Answer

Yes, io.net is optimized for AI inference workloads from small-scale API serving to high-throughput production deployments. You can run LLM inference (vLLM, TensorRT-LLM), image generation (Stable Diffusion, ComfyUI), speech recognition (Whisper), and custom models with autoscaling from 1 to 100+ GPUs. RTX 4090 GPUs at $0.18/hr provide excellent price-performance for inference (90% of H100 throughput at 8% of the cost), while H100s excel at high-concurrency serving. io.net's per-second billing and instant spin-up (<60 seconds) make it ideal for variable traffic patterns: scale to 20 GPUs during peak hours, scale to 2 GPUs overnight, and pay only for actual usage. Expect 75% cost savings vs. AWS SageMaker for equivalent inference throughput.

Inference Use Cases Supported

Large Language Model (LLM) Inference:
- Llama 3 8B/70B, Mistral 7B, GPT-style models
- Optimizations: vLLM, TensorRT-LLM, Text Generation Inference (TGI)
- Throughput: 50-500 tokens/sec depending on GPU and model size
- Cost: $0.00008-0.0003 per 1K tokens (vs. $0.0006-0.002 on AWS)

Image Generation:
- Stable Diffusion XL, SDXL Turbo, ControlNet
- ComfyUI and Automatic1111 one-click deployments
- Throughput: 0.6-2.5 images/sec on RTX 4090
- Cost: $0.001-0.003 per image (vs. $0.008-0.02 on Replicate)

Speech-to-Text:
- Whisper Large V3, Faster Whisper
- Real-time transcription and batch processing
- Throughput: 10-50x real-time on H100
- Cost: $0.0001-0.0005 per minute of audio

Computer Vision:
- Object detection (YOLO, DETR), segmentation, classification
- Video understanding and analysis
- Throughput: 30-500 FPS depending on model and GPU
- Cost: $0.00001-0.0001 per frame

Multimodal Models:
- CLIP, BLIP, LLaVA for vision-language tasks
- Text-to-image, image-to-text, visual question answering
- Cost-efficient deployment on RTX 4090 or L40S

GPU Recommendations for Inference

Choose the right GPU based on model size, throughput needs, and budget:

For LLM Inference (Llama 3 8B):

GPUTokens/sec (FP16)Tokens/sec (INT8)Cost/hrCost per 1M tokens
H100 SXM142385$2.20$0.00157
H100 PCIe118320$1.49$0.00129
A100 80GB95178$1.49$0.00169
L40S87165$0.75$0.00086
RTX 409082156$0.18$0.00023 

Best value: RTX 4090 at $0.00023 per 1M tokens (7x cheaper than H100)

For Llama 3 70B Inference:

GPUTokens/sec (FP16)Tokens/sec (INT8)Cost/hrFit in VRAM?
H100 SXM48125$2.20✅ Yes (80GB)
H100 PCIe41108$1.49✅ Yes (80GB)
A100 80GB3274$1.49✅ Yes (80GB)
RTX 4090N/A~12 (quantized)$0.18⚠️ Requires quantization (24GB)

Best choice: H100 PCIe at $1.49/hr for production 70B serving

For Stable Diffusion XL:

GPUImages/sec (512x512)Images/sec (1024x1024)Cost/hrCost per 1000 images
H1002.51.2$2.20$0.88
RTX 40901.70.8$0.18$0.11 
A1001.40.7$1.49$1.06
RTX 30900.90.4$0.28$0.31

Best value: RTX 4090 at $0.11 per 1000 images (8x cheaper than H100)

Inference Optimization Frameworks

io.net provides pre-optimized containers for maximum inference performance:

vLLM (Recommended for LLMs):

# Launch vLLM-optimized Llama 3 8B serving
io launch --gpu RTX4090 --image vllm/vllm-openai:latest

# Inside container:
vllm serve meta-llama/Llama-3-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --tensor-parallel-size 1

# OpenAI-compatible API endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8B",
    "prompt": "Explain quantum computing",
    "max_tokens": 256
  }'

# Throughput: ~82 tokens/sec on RTX 4090
# Latency: ~15ms time-to-first-token

TensorRT-LLM (Maximum Performance):

# 2x faster than vLLM, but requires model conversion
io launch --gpu H100 --image tensorrt-llm:latest

# Convert model to TensorRT
python convert_checkpoint.py \
  --model meta-llama/Llama-3-8B \
  --output /models/llama3-trt

# Serve with TensorRT
mpirun -n 1 python triton_server.py \
  --model /models/llama3-trt \
  --max-batch-size 128

# Throughput: ~180 tokens/sec on H100 (vs. 142 with vLLM)

Text Generation Inference (TGI):

# HuggingFace's optimized inference server
io launch --gpu A100 --image ghcr.io/huggingface/text-generation-inference:latest

docker run -p 8080:80 \
  -v /data:/data \
  --env MODEL_ID=meta-llama/Llama-3-8B \
  ghcr.io/huggingface/text-generation-inference:latest

# Easy deployment, good performance (80-90% of vLLM)

Triton Inference Server (Multi-Model):

# Serve multiple models on one GPU
io launch --gpu L40S --image nvcr.io/nvidia/tritonserver:latest

# Serve Llama 3 8B + Stable Diffusion + Whisper simultaneously
# Dynamic batching and model routing

Autoscaling for Variable Traffic

io.net's per-second billing makes autoscaling cost-effective:

Example: LLM API with Variable Traffic

# Auto-scale from 2 to 20 GPUs based on request queue depth

from ray import serve
from ray.serve.scaling_config import AutoscalingConfig

@serve.deployment(
    autoscaling_config=AutoscalingConfig(
        min_replicas=2,
        max_replicas=20,
        target_ongoing_requests=10
    ),
    ray_actor_options={"num_gpus": 1}
)
class LlamaInference:
    def __init__(self):
        from vllm import LLM
        self.model = LLM("meta-llama/Llama-3-8B")

    def __call__(self, prompt: str):
        return self.model.generate(prompt)

serve.run(LlamaInference.bind())

Cost Breakdown (RTX 4090 at $0.18/hr):

TimeTrafficGPUs ActiveCost/hrDaily Cost
9am-5pm (peak)5000 req/hr18 GPUs$3.24$25.92 (8hr)
5pm-midnight1000 req/hr6 GPUs$1.08$7.56 (7hr)
midnight-9am200 req/hr2 GPUs$0.36$3.24 (9hr)
TotalAvg: 8 GPUs$36.72/day

vs. Running 18 GPUs 24/7:
- Fixed 18 GPUs: $0.18 × 18 × 24 = $77.76/day
- Autoscaling: $36.72/day
Savings: $41.04/day (53%)

vs. AWS SageMaker Autoscaling:
- AWS ml.g5.xlarge (A10G): $1.01/hr × avg 12 instances × 24hr = $291/day
- io.net: $36.72/day
Savings: $254.28/day (87%)

Real-World Inference Deployments

Case Study 1: Chatbot Startup (Llama 3 8B)

  • Traffic: 50K requests/day (peak 200 req/min, avg 35 req/min)
  • Model: Llama 3 8B with vLLM
  • Configuration: 4-12x RTX 4090 with autoscaling
  • Average usage: 6 GPUs (autoscale based on queue depth)
  • Throughput: ~500 tokens/sec aggregate (6 GPUs × 82 tokens/sec)
  • Monthly cost: $0.18/hr × 6 GPUs × 730 hrs = $788/month
  • AWS equivalent: SageMaker with 6x ml.g5.xlarge = $4,424/month
  • Savings: $3,636/month (82%)

Case Study 2: Image Generation SaaS (Stable Diffusion XL)

  • Traffic: 100K images/day (variable: 1K/hr night, 8K/hr peak)
  • Model: SDXL with ComfyUI
  • Configuration: 2-16x RTX 4090 with autoscaling
  • Average usage: 8 GPUs
  • Throughput: ~14 images/sec aggregate (8 GPUs × 1.7 images/sec)
  • Monthly cost: $0.18/hr × 8 GPUs × 730 hrs = $1,051/month
  • Replicate equivalent: ~$0.015/image × 3M images/month = $45,000/month
  • Savings: $43,949/month (98%)

Case Study 3: Video Transcription Service (Whisper Large V3)

  • Traffic: 50,000 minutes of audio/day
  • Model: Faster Whisper (optimized)
  • Configuration: 4x L40S GPUs
  • Throughput: 30x real-time (process 30 min of audio in 1 minute)
  • Daily processing: 4 GPUs × 30x × 1440 min/day = 172,800 minutes capacity
  • Actual usage: 50K min/day = 29% utilization
  • Monthly cost: $0.75/hr × 4 GPUs × 730 hrs = $2,190/month
  • AWS Transcribe: $0.024/min × 1.5M min/month = $36,000/month
  • Savings: $33,810/month (94%)

Latency and Performance Optimization

Latency Breakdown (Llama 3 8B on RTX 4090):

MetricvLLMTensorRT-LLMOptimization
Time to First Token (TTFT)15ms8msLower is better for chat
Inter-token Latency12ms7msAffects streaming experience
End-to-End (256 tokens)3.2s1.9sOverall request time
Throughput (tokens/sec)82135Batch processing capacity

Optimizations for Low Latency:

  1. Enable Flash Attention 2
# 2x faster attention for transformers
from vllm import LLM
model = LLM("meta-llama/Llama-3-8B", use_flash_attention=True)
  1. Use INT8 Quantization
# 2x throughput with minimal quality loss
model = LLM("meta-llama/Llama-3-8B", quantization="int8")
  1. Tune Max Batch Size
# Balance latency vs. throughput
model = LLM("meta-llama/Llama-3-8B", max_num_batched_tokens=8192)
# Lower = lower latency, higher = higher throughput
  1. Regional Deployment
# Deploy GPUs close to users
io launch --gpu RTX4090 --region us-west  # West Coast users
io launch --gpu RTX4090 --region eu-west  # European users
# Reduces network latency by 50-150ms
  1. Use Speculative Decoding
# Draft model generates, main model verifies (1.5-2x speedup)
model = LLM("meta-llama/Llama-3-70B", speculative_model="meta-llama/Llama-3-8B")

Batch Inference for High Throughput

For non-interactive workloads (batch processing, embeddings), maximize throughput:

# Process 10,000 prompts in batch
from vllm import LLM

model = LLM("meta-llama/Llama-3-8B")

prompts = [f"Summarize: {text}" for text in documents]  # 10K prompts
outputs = model.generate(prompts, max_tokens=128)

# Throughput: ~2,000 prompts/hr on single RTX 4090
# Cost: $0.18/hr ÷ 2000 = $0.00009 per summary

Batch Inference Pricing:

WorkloadGPUThroughputCost/hrCost per 1000 Outputs
Text summarization (128 tokens)RTX 40902000/hr$0.18$0.09
Embeddings (sentence-transformers)RTX 409050K/hr$0.18$0.0036
Image classification (ResNet-50)RTX 4090100K/hr$0.18$0.0018
SD image generation (512x512)RTX 40906,000/hr$0.18$0.03

Multi-GPU Inference for High Concurrency

Serve large models across multiple GPUs:

# Llama 3 70B across 2x H100 (tensor parallelism)
io launch --gpu H100 --count 2 --network nvlink

# vLLM with tensor parallelism
vllm serve meta-llama/Llama-3-70B \
  --tensor-parallel-size 2 \
  --max-model-len 8192

# Throughput: 96 tokens/sec (2x H100)
# Cost: $4.40/hr (2x $2.20)
# vs. 4x RTX 4090 (quantized): ~48 tokens/sec, $0.72/hr

When to use multi-GPU inference:
- Models >40GB that don't fit in single GPU
- Extremely high concurrency (100+ simultaneous users)
- Low latency requirements (<10ms TTFT)

When to use multiple single-GPU instances:
- Better price-performance for most workloads
- Easier autoscaling (scale 1 GPU at a time)
- Fault tolerance (one GPU failure doesn't take down entire service)

Monitoring and Observability

Track inference performance in real-time:

# io.net dashboard shows:
# - Requests per second
# - Tokens per second
# - GPU utilization
# - Cost per 1K tokens
# - Latency p50/p95/p99

io dashboard io-llama-inference-cluster

Integrate with monitoring tools:

# Prometheus metrics
from prometheus_client import Counter, Histogram

requests_total = Counter('inference_requests_total', 'Total requests')
latency = Histogram('inference_latency_seconds', 'Request latency')

@latency.time()
def run_inference(prompt):
    requests_total.inc()
    return model.generate(prompt)

What's the cheapest way to run LLM inference?

RTX 4090 at $0.18/hr provides the best price-performance: 82 tokens/sec for Llama 3 8B = $0.00023 per 1M tokens. This is 7x cheaper than H100 and 20-30x cheaper than AWS SageMaker or OpenAI API ($0.005-0.015 per 1K tokens). For 70B models, quantized inference on 4x RTX 4090 ($0.72/hr) is more cost-effective than 1x H100 ($2.20/hr) if you can tolerate slight quality loss from quantization.

Can io.net handle real-time inference with <100ms latency?

Yes. Time-to-first-token (TTFT) is 8-15ms for optimized LLMs on H100/RTX 4090. Total latency (TTFT + generation) for 256-token responses is 1.9-3.2 seconds. For ultra-low latency (<50ms end-to-end), use smaller models (Llama 3 8B vs. 70B), enable speculative decoding, and deploy GPUs in the same region as your users. For <10ms latency, consider TensorRT-LLM with FP8 on H100.

How does inference pricing compare to OpenAI/Anthropic APIs?

io.net self-hosted inference is 20-60x cheaper for high-volume workloads. Llama 3 8B on RTX 4090: $0.00023 per 1K tokens. OpenAI GPT-3.5 Turbo: $0.0015 per 1K tokens (6x more expensive). Claude 3 Haiku: $0.0008 per 1K tokens (3.5x more expensive). Break-even point: ~1-5M tokens/month. Below that, API services are cheaper due to no infrastructure management. Above that, self-hosted wins.

Can I run Stable Diffusion inference at scale on io.net?

Yes. RTX 4090 generates SDXL images at 1.7 images/sec (512x512) or 0.8 images/sec (1024x1024) for $0.18/hr. This translates to $0.11 per 1,000 images - 10-50x cheaper than Replicate ($0.015/image) or Stability AI API ($0.02/image). For 100K images/day, use 8-16x RTX 4090 with autoscaling (~$1,000-2,000/month vs. $45,000/month on Replicate).

Do I need to manage inference serving infrastructure myself?

Partially. io.net provides the GPU infrastructure, but you deploy your own inference server (vLLM, TGI, Triton). Use pre-built containers for quick setup: io launch --gpu RTX4090 --image vllm/vllm-openai:latest gives you an OpenAI-compatible API in 60 seconds. For fully managed inference, consider io.net's Enterprise Managed Inference (launching Q3 2026) which handles deployment, monitoring, and autoscaling for you.

Deploy Inference Workloads on io.net

Start serving AI models 70% cheaper than AWS:
RTX 4090 at $0.18/hr - Best price-performance for inference
vLLM pre-configured - OpenAI-compatible API in 60 seconds
Autoscaling - Scale from 1 to 100+ GPUs based on traffic
Per-second billing - Pay only for active inference time

Launch inference GPU → or view inference pricing →


Last updated: April 2026 | Inference benchmarks based on vLLM 0.4.2 with optimized settings