Mixture of Experts (MoE) architectures have become the dominant design pattern for frontier language models. Mixtral, DeepSeek V3, Grok, and GPT-4 all use some variant of sparse expert routing. The appeal is straightforward: MoE models achieve the quality of dense models 5-10x their active parameter count, while only activating a fraction of their total weights per token.
But hosting MoE models introduces a unique set of infrastructure challenges. A model with 600 billion total parameters but only 40 billion active per forward pass still needs enough memory to store all 600 billion parameters. The expert routing logic creates unpredictable memory access patterns. And the cost math works differently than dense models.
io.net's GPU marketplace is well-suited for MoE workloads because it offers flexible multi-GPU configurations with high-bandwidth interconnects at competitive prices. H100 80GB GPUs at approximately $2.49/hr let you build MoE serving clusters at a fraction of hyperscaler costs.
This guide covers how MoE models work under the hood, what they demand from infrastructure, how to deploy them efficiently, and how to keep costs under control.
How MoE Models Work Under the Hood
Before we talk infrastructure, understanding the mechanics of MoE is essential for making good hosting decisions.
The Core Mechanism
A standard transformer layer processes every token through the same feed-forward network (FFN). An MoE layer replaces that single FFN with N parallel "expert" networks and a small router that decides which experts activate for each token.
Typically, only 2 out of N experts activate per token. The rest sit idle in memory. This is what makes MoE "sparse" --- the total parameter count is large, but the compute per token is modest.
Key Numbers for Popular MoE Models
| Model | Total Params | Active Params | Experts | Active/Token | VRAM (FP16) |
|---|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 | ~93 GB |
| Mixtral 8x22B | 141B | 39B | 8 | 2 | ~282 GB |
| DeepSeek V3 | 671B | 37B | 256 | 8 | ~1.34 TB |
| DBRX | 132B | 36B | 16 | 4 | ~264 GB |
| Grok-1 | 314B | ~86B | 8 | 2 | ~628 GB |
The takeaway: MoE models need memory proportional to their total parameter count, but compute proportional to their active parameter count.
Why MoE Is Challenging to Host
Memory wall: Mixtral 8x22B needs 282 GB in FP16 --- that is 4 H100 80GB GPUs just for weights, before accounting for KV cache. A dense model of equivalent quality (Llama 70B) fits on 2 GPUs.
Irregular memory access: Expert routing sends different tokens to different experts, creating non-sequential memory access patterns that are harder for GPU memory controllers to prefetch and optimize.
Load imbalance: Some experts get activated far more than others. This creates hotspots that bottleneck overall throughput.
Inter-GPU communication: When experts reside on different GPUs (expert parallelism), tokens must traverse the NVLink or InfiniBand fabric to reach the right expert. Bandwidth becomes the critical path.
Infrastructure Requirements
GPU Memory Planning
The first step is calculating how much VRAM you need:
def moe_vram_estimate(total_params_b, precision_bytes, kv_cache_gb, overhead=1.15):
"""Estimate total VRAM for MoE model hosting."""
model_gb = total_params_b * precision_bytes
total = (model_gb + kv_cache_gb) * overhead
return total
# Mixtral 8x22B in FP16 with 8K context, batch 8
vram = moe_vram_estimate(141, 2, kv_cache_gb=40)
gpus_needed = -(-int(vram) // 80) # Ceiling division
print(f"Need {vram:.0f} GB VRAM -> {gpus_needed} H100 GPUs")
Recommended Configurations on io.net
| MoE Model | Precision | GPU Config | io.net Cost/hr | Monthly (24/7) |
|---|---|---|---|---|
| Mixtral 8x7B | FP16 | 2x H100 80GB | $4.98 | $3,586 |
| Mixtral 8x7B | INT4 (AWQ) | 1x H100 80GB | $2.49 | $1,793 |
| Mixtral 8x22B | FP16 | 4x H100 80GB | $9.96 | $7,171 |
| Mixtral 8x22B | INT4 (AWQ) | 2x H100 80GB | $4.98 | $3,586 |
| DeepSeek V3 | FP8 | 8x H100 80GB | $19.92 | $14,342 |
| DeepSeek V3 | INT4 | 4x H100 80GB | $9.96 | $7,171 |
Compare those to hyperscaler pricing where 8x H100 instances start at $32+/hr on AWS --- io.net delivers the same NVLink-connected hardware at a fraction of the price.
NVLink vs. PCIe for MoE
Expert parallelism requires constant inter-GPU communication. The bandwidth gap between NVLink and PCIe is enormous:
| Interconnect | Bandwidth | Mixtral 8x22B Throughput | TTFT |
|---|---|---|---|
| NVLink 4 (H100 SXM) | 900 GB/s bidirectional | ~85 tokens/s | 42ms |
| PCIe Gen5 (H100 PCIe) | 128 GB/s bidirectional | ~35 tokens/s | 108ms |
Always request NVLink-connected GPUs on io.net for MoE workloads. The throughput difference pays for itself.
Step-by-Step: Deploying Mixtral 8x22B on io.net
Step 1: Provision Your Cluster
from ionet import Client
client = Client(api_key="your-key")
cluster = client.create_cluster(
name="mixtral-8x22b-serving",
gpu_type="H100_SXM",
gpu_count=4,
region="us-east",
image="vllm/vllm-openai:v0.7.2",
storage_gb=500,
networking="nvlink"
)
Step 2: Launch vLLM with Tensor Parallelism
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--port 8000
Step 3: Verify and Benchmark
import requests, time
url = "http://localhost:8000/v1/completions"
payload = {
"model": "mistralai/Mixtral-8x22B-Instruct-v0.1",
"prompt": "Explain the benefits of sparse expert models for inference.",
"max_tokens": 512,
"temperature": 0.7
}
start = time.time()
resp = requests.post(url, json=payload)
elapsed = time.time() - start
tokens = resp.json()["usage"]["completion_tokens"]
print(f"Generated {tokens} tokens in {elapsed:.2f}s = {tokens/elapsed:.0f} tok/s")
Deploy MoE Models on io.net
Get NVLink-connected H100 GPUs at $2.49/hr --- ideal for memory-intensive MoE architectures. Deploy Mixtral, DeepSeek V3, or any sparse model in minutes.
Optimizing MoE Inference Performance
Quantization for MoE
Quantization is even more impactful for MoE models than dense models. Reducing precision directly shrinks the memory footprint that drives your GPU costs.
| Model | FP16 VRAM | INT4 VRAM | GPU Savings | Cost Reduction |
|---|---|---|---|---|
| Mixtral 8x7B | 93 GB (2 GPUs) | 24 GB (1 GPU) | 50% fewer GPUs | 50% |
| Mixtral 8x22B | 282 GB (4 GPUs) | 71 GB (1 GPU) | 75% fewer GPUs | 75% |
| DeepSeek V3 | 1.34 TB (17 GPUs) | 336 GB (5 GPUs) | 70% fewer GPUs | 70% |
AWQ and GPTQ quantization for MoE models preserves expert routing decisions, which is the primary quality-determining factor. The quality impact of INT4 quantization on MoE models is typically 1-2% on standard benchmarks --- imperceptible to end users.
Expert Offloading
For extremely large models, offload inactive experts to CPU memory and load them on-demand:
import deepspeed
model = deepspeed.init_inference(
model,
mp_size=4,
dtype=torch.float16,
replace_with_kernel_inject=True,
moe_experts_offload=True,
moe_experts_in_gpu=16,
)
This trades latency for cost. Useful for batch processing but not for real-time serving.
Expert Load Balancing Strategies
Uneven expert activation wastes GPU resources. Consider:
- Capacity factor tuning: Limit the maximum tokens per expert to prevent hotspots
- Expert replication: Duplicate frequently-used experts across multiple GPUs
- Dynamic routing thresholds: Adjust router confidence thresholds to spread load more evenly
- Auxiliary balance loss: During fine-tuning, add a loss term that penalizes imbalanced expert usage
Cost Comparison: MoE vs. Dense Models
Quality-Adjusted Cost Analysis
| Model Type | Model | GPUs (io.net) | Cost/hr | MMLU Score | Cost per Quality Point |
|---|---|---|---|---|---|
| Dense | Llama 3.1 70B | 2x H100 | $4.98 | 82.0 | $0.061 |
| MoE (FP16) | Mixtral 8x22B | 4x H100 | $9.96 | 77.8 | $0.128 |
| MoE (INT4) | Mixtral 8x22B | 2x H100 | $4.98 | 76.5 | $0.065 |
| Large MoE (INT4) | DeepSeek V3 | 4x H100 | $9.96 | 87.1 | $0.114 |
Quantized MoE models are competitive with dense models on a cost-per-quality basis. The real advantage of MoE emerges at the frontier, where DeepSeek V3 achieves quality levels no dense model can match at similar serving cost.
When MoE Wins on Cost
MoE delivers cost advantages when:
- You need quality above what 70B dense models can provide
- Your workload is throughput-oriented (high batch sizes)
- You can use INT4 quantization (most production use cases)
- You have NVLink-connected GPUs (io.net provides this)
When Dense Wins
Dense models are more cost-effective when:
- 70B-class quality is sufficient for your application
- Latency is the dominant constraint (dense models have simpler computation graphs)
- You prefer single-GPU simplicity (7B-13B dense models on one A100)
Expert Parallelism vs. Tensor Parallelism for MoE
| Strategy | Latency | Throughput | Communication | Best For |
|---|---|---|---|---|
| Tensor Parallelism only | Lower | Lower | All-reduce per layer | Real-time chat |
| Expert Parallelism only | Higher | Higher | All-to-all per layer | Batch processing |
| TP + EP (hybrid) | Balanced | Balanced | Both patterns | Production serving |
For most io.net deployments with 2-4 NVLink-connected H100s, pure tensor parallelism provides the best balance of latency and throughput.
Advanced: Building a MoE Serving Pipeline
Auto-Scaling MoE Inference
# Kubernetes HPA configuration for MoE serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: moe-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mixtral-serving
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"
Model Warm-Up and Health Checks
MoE models benefit from warm-up requests that exercise all expert paths:
def warmup_moe_model(endpoint, num_requests=50):
"""Send diverse prompts to activate all experts."""
diverse_prompts = [
"Explain quantum computing",
"Write Python code for sorting",
"Translate English to French: Hello world",
"Summarize the history of the internet",
# ... diverse topics to activate different expert paths
]
for prompt in diverse_prompts[:num_requests]:
requests.post(f"{endpoint}/v1/completions", json={
"prompt": prompt, "max_tokens": 10
})
print("Warm-up complete: all experts activated")

Frequently Asked Questions
Which MoE models can I host on io.net?
Any MoE model available on HuggingFace or as custom weights. Popular choices: Mixtral 8x7B, Mixtral 8x22B, DeepSeek V2/V3, DBRX, and custom fine-tuned variants.
How many GPUs for Mixtral 8x22B?
FP16: 4x H100 80GB ($9.96/hr on io.net). INT4: 1-2x H100 80GB ($2.49-$4.98/hr). INT4 delivers 85-90% of full-precision quality.
Is MoE inference faster than dense?
Per token, MoE activates fewer parameters than an equivalently-sized dense model, so yes. Total throughput depends on memory bandwidth and expert routing. Optimized MoE on io.net is 30-50% more cost-efficient than equivalently-accurate dense models.
Can I fine-tune MoE models on io.net?
Yes. Plan for 2x the GPU memory compared to inference. LoRA and QLoRA reduce memory requirements significantly.
What serving framework is best?
vLLM 0.7+ has excellent MoE support. TensorRT-LLM offers optimized CUDA kernels for MoE. SGLang handles structured generation well with MoE architectures.
How does expert routing affect latency?
The routing decision adds less than 1ms. The latency impact comes from inter-GPU communication when selected experts are on different GPUs. NVLink on io.net minimizes this overhead.
Should I use MoE or dense for my use case?
MoE wins when you need frontier quality at high throughput. Dense wins when latency dominates and 70B-class quality suffices.
What about training custom MoE models?
Training MoE from scratch requires careful expert initialization and load balancing. Most teams start with a pre-trained MoE and fine-tune. Megatron-LM and DeepSpeed both support MoE training on io.net clusters.
Getting Started
- Start with Mixtral 8x7B in INT4 on a single H100 ($2.49/hr) to validate your pipeline
- Scale to Mixtral 8x22B or DeepSeek V3 as quality requirements grow
- Implement quantization early --- the GPU savings are dramatic for MoE
- Always use NVLink-connected GPUs for multi-GPU MoE deployments
- Monitor expert utilization and rebalance if hotspots appear
MoE architectures are the future of efficient large-scale AI. The hosting challenge is real, but io.net's combination of affordable H100s and flexible NVLink configurations makes it tractable for any team.
Deploy your first MoE model on io.net today. Get started at io.net.