FAQ: Can I Run Inference Workloads on io.net?

Quick Answer

Yes, io.net is optimized for AI inference workloads from small-scale API serving to high-throughput production deployments. You can run LLM inference (vLLM, TensorRT-LLM), image generation (Stable Diffusion, ComfyUI), speech recognition (Whisper), and custom models with autoscaling from 1 to 100+ GPUs. RTX 4090 GPUs at $0.18/hr provide excellent price-performance for inference (90% of H100 throughput at 8% of the cost), while H100s excel at high-concurrency serving. io.net's per-second billing and instant spin-up (<60 seconds) make it ideal for variable traffic patterns: scale to 20 GPUs during peak hours, scale to 2 GPUs overnight, and pay only for actual usage. Expect 75% cost savings vs. AWS SageMaker for equivalent inference throughput.

Inference Use Cases Supported

Large Language Model (LLM) Inference:
- Llama 3 8B/70B, Mistral 7B, GPT-style models
- Optimizations: vLLM, TensorRT-LLM, Text Generation Inference (TGI)
- Throughput: 50-500 tokens/sec depending on GPU and model size
- Cost: $0.00008-0.0003 per 1K tokens (vs. $0.0006-0.002 on AWS)

Image Generation:
- Stable Diffusion XL, SDXL Turbo, ControlNet
- ComfyUI and Automatic1111 one-click deployments
- Throughput: 0.6-2.5 images/sec on RTX 4090
- Cost: $0.001-0.003 per image (vs. $0.008-0.02 on Replicate)

Speech-to-Text:
- Whisper Large V3, Faster Whisper
- Real-time transcription and batch processing
- Throughput: 10-50x real-time on H100
- Cost: $0.0001-0.0005 per minute of audio

Computer Vision:
- Object detection (YOLO, DETR), segmentation, classification
- Video understanding and analysis
- Throughput: 30-500 FPS depending on model and GPU
- Cost: $0.00001-0.0001 per frame

Multimodal Models:
- CLIP, BLIP, LLaVA for vision-language tasks
- Text-to-image, image-to-text, visual question answering
- Cost-efficient deployment on RTX 4090 or L40S

GPU Recommendations for Inference

Choose the right GPU based on model size, throughput needs, and budget:

For LLM Inference (Llama 3 8B):

GPU	Tokens/sec (FP16)	Tokens/sec (INT8)	Cost/hr	Cost per 1M tokens
H100 SXM	142	385	$2.20	$0.00157
H100 PCIe	118	320	$1.49	$0.00129
A100 80GB	95	178	$1.49	$0.00169
L40S	87	165	$0.75	$0.00086
RTX 4090	82	156	$0.18	$0.00023 ⭐

Best value: RTX 4090 at $0.00023 per 1M tokens (7x cheaper than H100)

For Llama 3 70B Inference:

GPU	Tokens/sec (FP16)	Tokens/sec (INT8)	Cost/hr	Fit in VRAM?
H100 SXM	48	125	$2.20	✅ Yes (80GB)
H100 PCIe	41	108	$1.49	✅ Yes (80GB)
A100 80GB	32	74	$1.49	✅ Yes (80GB)
RTX 4090	N/A	~12 (quantized)	$0.18	⚠️ Requires quantization (24GB)

Best choice: H100 PCIe at $1.49/hr for production 70B serving

For Stable Diffusion XL:

GPU	Images/sec (512x512)	Images/sec (1024x1024)	Cost/hr	Cost per 1000 images
H100	2.5	1.2	$2.20	$0.88
RTX 4090	1.7	0.8	$0.18	$0.11 ⭐
A100	1.4	0.7	$1.49	$1.06
RTX 3090	0.9	0.4	$0.28	$0.31

Best value: RTX 4090 at $0.11 per 1000 images (8x cheaper than H100)

Inference Optimization Frameworks

io.net provides pre-optimized containers for maximum inference performance:

vLLM (Recommended for LLMs):

# Launch vLLM-optimized Llama 3 8B serving
io launch --gpu RTX4090 --image vllm/vllm-openai:latest

# Inside container:
vllm serve meta-llama/Llama-3-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --tensor-parallel-size 1

# OpenAI-compatible API endpoint
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8B",
    "prompt": "Explain quantum computing",
    "max_tokens": 256
  }'

# Throughput: ~82 tokens/sec on RTX 4090
# Latency: ~15ms time-to-first-token

TensorRT-LLM (Maximum Performance):

# 2x faster than vLLM, but requires model conversion
io launch --gpu H100 --image tensorrt-llm:latest

# Convert model to TensorRT
python convert_checkpoint.py \
  --model meta-llama/Llama-3-8B \
  --output /models/llama3-trt

# Serve with TensorRT
mpirun -n 1 python triton_server.py \
  --model /models/llama3-trt \
  --max-batch-size 128

# Throughput: ~180 tokens/sec on H100 (vs. 142 with vLLM)

Text Generation Inference (TGI):

# HuggingFace's optimized inference server
io launch --gpu A100 --image ghcr.io/huggingface/text-generation-inference:latest

docker run -p 8080:80 \
  -v /data:/data \
  --env MODEL_ID=meta-llama/Llama-3-8B \
  ghcr.io/huggingface/text-generation-inference:latest

# Easy deployment, good performance (80-90% of vLLM)

Triton Inference Server (Multi-Model):

# Serve multiple models on one GPU
io launch --gpu L40S --image nvcr.io/nvidia/tritonserver:latest

# Serve Llama 3 8B + Stable Diffusion + Whisper simultaneously
# Dynamic batching and model routing

Autoscaling for Variable Traffic

io.net's per-second billing makes autoscaling cost-effective:

Example: LLM API with Variable Traffic

# Auto-scale from 2 to 20 GPUs based on request queue depth

from ray import serve
from ray.serve.scaling_config import AutoscalingConfig

@serve.deployment(
    autoscaling_config=AutoscalingConfig(
        min_replicas=2,
        max_replicas=20,
        target_ongoing_requests=10
    ),
    ray_actor_options={"num_gpus": 1}
)
class LlamaInference:
    def __init__(self):
        from vllm import LLM
        self.model = LLM("meta-llama/Llama-3-8B")

    def __call__(self, prompt: str):
        return self.model.generate(prompt)

serve.run(LlamaInference.bind())

Cost Breakdown (RTX 4090 at $0.18/hr):

Time	Traffic	GPUs Active	Cost/hr	Daily Cost
9am-5pm (peak)	5000 req/hr	18 GPUs	$3.24	$25.92 (8hr)
5pm-midnight	1000 req/hr	6 GPUs	$1.08	$7.56 (7hr)
midnight-9am	200 req/hr	2 GPUs	$0.36	$3.24 (9hr)
Total		Avg: 8 GPUs		$36.72/day

vs. Running 18 GPUs 24/7:
- Fixed 18 GPUs: $0.18 × 18 × 24 = $77.76/day
- Autoscaling: $36.72/day
- Savings: $41.04/day (53%)

vs. AWS SageMaker Autoscaling:
- AWS ml.g5.xlarge (A10G): $1.01/hr × avg 12 instances × 24hr = $291/day
- io.net: $36.72/day
- Savings: $254.28/day (87%)

Real-World Inference Deployments

Case Study 1: Chatbot Startup (Llama 3 8B)

Traffic: 50K requests/day (peak 200 req/min, avg 35 req/min)
Model: Llama 3 8B with vLLM
Configuration: 4-12x RTX 4090 with autoscaling
Average usage: 6 GPUs (autoscale based on queue depth)
Throughput: ~500 tokens/sec aggregate (6 GPUs × 82 tokens/sec)
Monthly cost: $0.18/hr × 6 GPUs × 730 hrs = $788/month
AWS equivalent: SageMaker with 6x ml.g5.xlarge = $4,424/month
Savings: $3,636/month (82%)

Case Study 2: Image Generation SaaS (Stable Diffusion XL)

Traffic: 100K images/day (variable: 1K/hr night, 8K/hr peak)
Model: SDXL with ComfyUI
Configuration: 2-16x RTX 4090 with autoscaling
Average usage: 8 GPUs
Throughput: ~14 images/sec aggregate (8 GPUs × 1.7 images/sec)
Monthly cost: $0.18/hr × 8 GPUs × 730 hrs = $1,051/month
Replicate equivalent: ~$0.015/image × 3M images/month = $45,000/month
Savings: $43,949/month (98%)

Case Study 3: Video Transcription Service (Whisper Large V3)

Traffic: 50,000 minutes of audio/day
Model: Faster Whisper (optimized)
Configuration: 4x L40S GPUs
Throughput: 30x real-time (process 30 min of audio in 1 minute)
Daily processing: 4 GPUs × 30x × 1440 min/day = 172,800 minutes capacity
Actual usage: 50K min/day = 29% utilization
Monthly cost: $0.75/hr × 4 GPUs × 730 hrs = $2,190/month
AWS Transcribe: $0.024/min × 1.5M min/month = $36,000/month
Savings: $33,810/month (94%)

Latency and Performance Optimization

Latency Breakdown (Llama 3 8B on RTX 4090):

Metric	vLLM	TensorRT-LLM	Optimization
Time to First Token (TTFT)	15ms	8ms	Lower is better for chat
Inter-token Latency	12ms	7ms	Affects streaming experience
End-to-End (256 tokens)	3.2s	1.9s	Overall request time
Throughput (tokens/sec)	82	135	Batch processing capacity

Optimizations for Low Latency:

Enable Flash Attention 2

# 2x faster attention for transformers
from vllm import LLM
model = LLM("meta-llama/Llama-3-8B", use_flash_attention=True)

Use INT8 Quantization

# 2x throughput with minimal quality loss
model = LLM("meta-llama/Llama-3-8B", quantization="int8")

Tune Max Batch Size

# Balance latency vs. throughput
model = LLM("meta-llama/Llama-3-8B", max_num_batched_tokens=8192)
# Lower = lower latency, higher = higher throughput

Regional Deployment

# Deploy GPUs close to users
io launch --gpu RTX4090 --region us-west  # West Coast users
io launch --gpu RTX4090 --region eu-west  # European users
# Reduces network latency by 50-150ms

Use Speculative Decoding

# Draft model generates, main model verifies (1.5-2x speedup)
model = LLM("meta-llama/Llama-3-70B", speculative_model="meta-llama/Llama-3-8B")

Batch Inference for High Throughput

For non-interactive workloads (batch processing, embeddings), maximize throughput:

# Process 10,000 prompts in batch
from vllm import LLM

model = LLM("meta-llama/Llama-3-8B")

prompts = [f"Summarize: {text}" for text in documents]  # 10K prompts
outputs = model.generate(prompts, max_tokens=128)

# Throughput: ~2,000 prompts/hr on single RTX 4090
# Cost: $0.18/hr ÷ 2000 = $0.00009 per summary

Batch Inference Pricing:

Workload	GPU	Throughput	Cost/hr	Cost per 1000 Outputs
Text summarization (128 tokens)	RTX 4090	2000/hr	$0.18	$0.09
Embeddings (sentence-transformers)	RTX 4090	50K/hr	$0.18	$0.0036
Image classification (ResNet-50)	RTX 4090	100K/hr	$0.18	$0.0018
SD image generation (512x512)	RTX 4090	6,000/hr	$0.18	$0.03

Multi-GPU Inference for High Concurrency

Serve large models across multiple GPUs:

# Llama 3 70B across 2x H100 (tensor parallelism)
io launch --gpu H100 --count 2 --network nvlink

# vLLM with tensor parallelism
vllm serve meta-llama/Llama-3-70B \
  --tensor-parallel-size 2 \
  --max-model-len 8192

# Throughput: 96 tokens/sec (2x H100)
# Cost: $4.40/hr (2x $2.20)
# vs. 4x RTX 4090 (quantized): ~48 tokens/sec, $0.72/hr

When to use multi-GPU inference:
- Models >40GB that don't fit in single GPU
- Extremely high concurrency (100+ simultaneous users)
- Low latency requirements (<10ms TTFT)

When to use multiple single-GPU instances:
- Better price-performance for most workloads
- Easier autoscaling (scale 1 GPU at a time)
- Fault tolerance (one GPU failure doesn't take down entire service)

Monitoring and Observability

Track inference performance in real-time:

# io.net dashboard shows:
# - Requests per second
# - Tokens per second
# - GPU utilization
# - Cost per 1K tokens
# - Latency p50/p95/p99

io dashboard io-llama-inference-cluster

Integrate with monitoring tools:

# Prometheus metrics
from prometheus_client import Counter, Histogram

requests_total = Counter('inference_requests_total', 'Total requests')
latency = Histogram('inference_latency_seconds', 'Request latency')

@latency.time()
def run_inference(prompt):
    requests_total.inc()
    return model.generate(prompt)

What's the cheapest way to run LLM inference?

RTX 4090 at $0.18/hr provides the best price-performance: 82 tokens/sec for Llama 3 8B = $0.00023 per 1M tokens. This is 7x cheaper than H100 and 20-30x cheaper than AWS SageMaker or OpenAI API ($0.005-0.015 per 1K tokens). For 70B models, quantized inference on 4x RTX 4090 ($0.72/hr) is more cost-effective than 1x H100 ($2.20/hr) if you can tolerate slight quality loss from quantization.

Can io.net handle real-time inference with <100ms latency?

Yes. Time-to-first-token (TTFT) is 8-15ms for optimized LLMs on H100/RTX 4090. Total latency (TTFT + generation) for 256-token responses is 1.9-3.2 seconds. For ultra-low latency (<50ms end-to-end), use smaller models (Llama 3 8B vs. 70B), enable speculative decoding, and deploy GPUs in the same region as your users. For <10ms latency, consider TensorRT-LLM with FP8 on H100.

How does inference pricing compare to OpenAI/Anthropic APIs?

io.net self-hosted inference is 20-60x cheaper for high-volume workloads. Llama 3 8B on RTX 4090: $0.00023 per 1K tokens. OpenAI GPT-3.5 Turbo: $0.0015 per 1K tokens (6x more expensive). Claude 3 Haiku: $0.0008 per 1K tokens (3.5x more expensive). Break-even point: ~1-5M tokens/month. Below that, API services are cheaper due to no infrastructure management. Above that, self-hosted wins.

Can I run Stable Diffusion inference at scale on io.net?

Yes. RTX 4090 generates SDXL images at 1.7 images/sec (512x512) or 0.8 images/sec (1024x1024) for $0.18/hr. This translates to $0.11 per 1,000 images - 10-50x cheaper than Replicate ($0.015/image) or Stability AI API ($0.02/image). For 100K images/day, use 8-16x RTX 4090 with autoscaling (~$1,000-2,000/month vs. $45,000/month on Replicate).

Do I need to manage inference serving infrastructure myself?

Partially. io.net provides the GPU infrastructure, but you deploy your own inference server (vLLM, TGI, Triton). Use pre-built containers for quick setup: io launch --gpu RTX4090 --image vllm/vllm-openai:latest gives you an OpenAI-compatible API in 60 seconds. For fully managed inference, consider io.net's Enterprise Managed Inference (launching Q3 2026) which handles deployment, monitoring, and autoscaling for you.

Deploy Inference Workloads on io.net

Start serving AI models 70% cheaper than AWS:
- RTX 4090 at $0.18/hr - Best price-performance for inference
- vLLM pre-configured - OpenAI-compatible API in 60 seconds
- Autoscaling - Scale from 1 to 100+ GPUs based on traffic
- Per-second billing - Pay only for active inference time

Launch inference GPU → or view inference pricing →

Last updated: April 2026 | Inference benchmarks based on vLLM 0.4.2 with optimized settings