Mixture of Experts (MoE) architectures have become the dominant design pattern for frontier language models. Mixtral, DeepSeek V3, Grok, and GPT-4 all use some variant of sparse expert routing. The appeal is straightforward: MoE models achieve the quality of dense models 5-10x their active parameter count, while only activating a fraction of their total weights per token.

But hosting MoE models introduces a unique set of infrastructure challenges. A model with 600 billion total parameters but only 40 billion active per forward pass still needs enough memory to store all 600 billion parameters. The expert routing logic creates unpredictable memory access patterns. And the cost math works differently than dense models.

io.net's GPU marketplace is well-suited for MoE workloads because it offers flexible multi-GPU configurations with high-bandwidth interconnects at competitive prices. H100 80GB GPUs at approximately $2.49/hr let you build MoE serving clusters at a fraction of hyperscaler costs.

This guide covers how MoE models work under the hood, what they demand from infrastructure, how to deploy them efficiently, and how to keep costs under control.

How MoE Models Work Under the Hood

Before we talk infrastructure, understanding the mechanics of MoE is essential for making good hosting decisions.

The Core Mechanism

A standard transformer layer processes every token through the same feed-forward network (FFN). An MoE layer replaces that single FFN with N parallel "expert" networks and a small router that decides which experts activate for each token.

Typically, only 2 out of N experts activate per token. The rest sit idle in memory. This is what makes MoE "sparse" --- the total parameter count is large, but the compute per token is modest.

ModelTotal ParamsActive ParamsExpertsActive/TokenVRAM (FP16)
Mixtral 8x7B46.7B12.9B82~93 GB
Mixtral 8x22B141B39B82~282 GB
DeepSeek V3671B37B2568~1.34 TB
DBRX132B36B164~264 GB
Grok-1314B~86B82~628 GB

The takeaway: MoE models need memory proportional to their total parameter count, but compute proportional to their active parameter count.

Why MoE Is Challenging to Host

Memory wall: Mixtral 8x22B needs 282 GB in FP16 --- that is 4 H100 80GB GPUs just for weights, before accounting for KV cache. A dense model of equivalent quality (Llama 70B) fits on 2 GPUs.

Irregular memory access: Expert routing sends different tokens to different experts, creating non-sequential memory access patterns that are harder for GPU memory controllers to prefetch and optimize.

Load imbalance: Some experts get activated far more than others. This creates hotspots that bottleneck overall throughput.

Inter-GPU communication: When experts reside on different GPUs (expert parallelism), tokens must traverse the NVLink or InfiniBand fabric to reach the right expert. Bandwidth becomes the critical path.

Infrastructure Requirements

GPU Memory Planning

The first step is calculating how much VRAM you need:

def moe_vram_estimate(total_params_b, precision_bytes, kv_cache_gb, overhead=1.15):
"""Estimate total VRAM for MoE model hosting."""
model_gb = total_params_b * precision_bytes
total = (model_gb + kv_cache_gb) * overhead
return total

# Mixtral 8x22B in FP16 with 8K context, batch 8
vram = moe_vram_estimate(141, 2, kv_cache_gb=40)
gpus_needed = -(-int(vram) // 80) # Ceiling division
print(f"Need {vram:.0f} GB VRAM -> {gpus_needed} H100 GPUs")

MoE ModelPrecisionGPU Configio.net Cost/hrMonthly (24/7)
Mixtral 8x7BFP162x H100 80GB$4.98$3,586
Mixtral 8x7BINT4 (AWQ)1x H100 80GB$2.49$1,793
Mixtral 8x22BFP164x H100 80GB$9.96$7,171
Mixtral 8x22BINT4 (AWQ)2x H100 80GB$4.98$3,586
DeepSeek V3FP88x H100 80GB$19.92$14,342
DeepSeek V3INT44x H100 80GB$9.96$7,171

Compare those to hyperscaler pricing where 8x H100 instances start at $32+/hr on AWS --- io.net delivers the same NVLink-connected hardware at a fraction of the price.

Expert parallelism requires constant inter-GPU communication. The bandwidth gap between NVLink and PCIe is enormous:

InterconnectBandwidthMixtral 8x22B ThroughputTTFT
NVLink 4 (H100 SXM)900 GB/s bidirectional~85 tokens/s42ms
PCIe Gen5 (H100 PCIe)128 GB/s bidirectional~35 tokens/s108ms

Always request NVLink-connected GPUs on io.net for MoE workloads. The throughput difference pays for itself.

Step-by-Step: Deploying Mixtral 8x22B on io.net

Step 1: Provision Your Cluster

from ionet import Client

client = Client(api_key="your-key")
cluster = client.create_cluster(
name="mixtral-8x22b-serving",
gpu_type="H100_SXM",
gpu_count=4,
region="us-east",
image="vllm/vllm-openai:v0.7.2",
storage_gb=500,
networking="nvlink"
)

Step 2: Launch vLLM with Tensor Parallelism

python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x22B-Instruct-v0.1 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--port 8000

Step 3: Verify and Benchmark

import requests, time

url = "http://localhost:8000/v1/completions"
payload = {
"model": "mistralai/Mixtral-8x22B-Instruct-v0.1",
"prompt": "Explain the benefits of sparse expert models for inference.",
"max_tokens": 512,
"temperature": 0.7
}

start = time.time()
resp = requests.post(url, json=payload)
elapsed = time.time() - start
tokens = resp.json()["usage"]["completion_tokens"]
print(f"Generated {tokens} tokens in {elapsed:.2f}s = {tokens/elapsed:.0f} tok/s")

Deploy MoE Models on io.net

Get NVLink-connected H100 GPUs at $2.49/hr --- ideal for memory-intensive MoE architectures. Deploy Mixtral, DeepSeek V3, or any sparse model in minutes.

Deploy Now

Optimizing MoE Inference Performance

Quantization for MoE

Quantization is even more impactful for MoE models than dense models. Reducing precision directly shrinks the memory footprint that drives your GPU costs.

ModelFP16 VRAMINT4 VRAMGPU SavingsCost Reduction
Mixtral 8x7B93 GB (2 GPUs)24 GB (1 GPU)50% fewer GPUs50%
Mixtral 8x22B282 GB (4 GPUs)71 GB (1 GPU)75% fewer GPUs75%
DeepSeek V31.34 TB (17 GPUs)336 GB (5 GPUs)70% fewer GPUs70%

AWQ and GPTQ quantization for MoE models preserves expert routing decisions, which is the primary quality-determining factor. The quality impact of INT4 quantization on MoE models is typically 1-2% on standard benchmarks --- imperceptible to end users.

Expert Offloading

For extremely large models, offload inactive experts to CPU memory and load them on-demand:

import deepspeed

model = deepspeed.init_inference(
model,
mp_size=4,
dtype=torch.float16,
replace_with_kernel_inject=True,
moe_experts_offload=True,
moe_experts_in_gpu=16,
)

This trades latency for cost. Useful for batch processing but not for real-time serving.

Expert Load Balancing Strategies

Uneven expert activation wastes GPU resources. Consider:

  1. Capacity factor tuning: Limit the maximum tokens per expert to prevent hotspots
  2. Expert replication: Duplicate frequently-used experts across multiple GPUs
  3. Dynamic routing thresholds: Adjust router confidence thresholds to spread load more evenly
  4. Auxiliary balance loss: During fine-tuning, add a loss term that penalizes imbalanced expert usage

Cost Comparison: MoE vs. Dense Models

Quality-Adjusted Cost Analysis

Model TypeModelGPUs (io.net)Cost/hrMMLU ScoreCost per Quality Point
DenseLlama 3.1 70B2x H100$4.9882.0$0.061
MoE (FP16)Mixtral 8x22B4x H100$9.9677.8$0.128
MoE (INT4)Mixtral 8x22B2x H100$4.9876.5$0.065
Large MoE (INT4)DeepSeek V34x H100$9.9687.1$0.114

Quantized MoE models are competitive with dense models on a cost-per-quality basis. The real advantage of MoE emerges at the frontier, where DeepSeek V3 achieves quality levels no dense model can match at similar serving cost.

When MoE Wins on Cost

MoE delivers cost advantages when:

  • You need quality above what 70B dense models can provide
  • Your workload is throughput-oriented (high batch sizes)
  • You can use INT4 quantization (most production use cases)
  • You have NVLink-connected GPUs (io.net provides this)

When Dense Wins

Dense models are more cost-effective when:

  • 70B-class quality is sufficient for your application
  • Latency is the dominant constraint (dense models have simpler computation graphs)
  • You prefer single-GPU simplicity (7B-13B dense models on one A100)

Expert Parallelism vs. Tensor Parallelism for MoE

StrategyLatencyThroughputCommunicationBest For
Tensor Parallelism onlyLowerLowerAll-reduce per layerReal-time chat
Expert Parallelism onlyHigherHigherAll-to-all per layerBatch processing
TP + EP (hybrid)BalancedBalancedBoth patternsProduction serving

For most io.net deployments with 2-4 NVLink-connected H100s, pure tensor parallelism provides the best balance of latency and throughput.

Advanced: Building a MoE Serving Pipeline

Auto-Scaling MoE Inference

# Kubernetes HPA configuration for MoE serving
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: moe-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mixtral-serving
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "75"

Model Warm-Up and Health Checks

MoE models benefit from warm-up requests that exercise all expert paths:

def warmup_moe_model(endpoint, num_requests=50):
"""Send diverse prompts to activate all experts."""
diverse_prompts = [
"Explain quantum computing",
"Write Python code for sorting",
"Translate English to French: Hello world",
"Summarize the history of the internet",
# ... diverse topics to activate different expert paths
]
for prompt in diverse_prompts[:num_requests]:
requests.post(f"{endpoint}/v1/completions", json={
"prompt": prompt, "max_tokens": 10
})
print("Warm-up complete: all experts activated")

Frequently Asked Questions

Which MoE models can I host on io.net?

Any MoE model available on HuggingFace or as custom weights. Popular choices: Mixtral 8x7B, Mixtral 8x22B, DeepSeek V2/V3, DBRX, and custom fine-tuned variants.

How many GPUs for Mixtral 8x22B?

FP16: 4x H100 80GB ($9.96/hr on io.net). INT4: 1-2x H100 80GB ($2.49-$4.98/hr). INT4 delivers 85-90% of full-precision quality.

Is MoE inference faster than dense?

Per token, MoE activates fewer parameters than an equivalently-sized dense model, so yes. Total throughput depends on memory bandwidth and expert routing. Optimized MoE on io.net is 30-50% more cost-efficient than equivalently-accurate dense models.

Can I fine-tune MoE models on io.net?

Yes. Plan for 2x the GPU memory compared to inference. LoRA and QLoRA reduce memory requirements significantly.

What serving framework is best?

vLLM 0.7+ has excellent MoE support. TensorRT-LLM offers optimized CUDA kernels for MoE. SGLang handles structured generation well with MoE architectures.

How does expert routing affect latency?

The routing decision adds less than 1ms. The latency impact comes from inter-GPU communication when selected experts are on different GPUs. NVLink on io.net minimizes this overhead.

Should I use MoE or dense for my use case?

MoE wins when you need frontier quality at high throughput. Dense wins when latency dominates and 70B-class quality suffices.

What about training custom MoE models?

Training MoE from scratch requires careful expert initialization and load balancing. Most teams start with a pre-trained MoE and fine-tune. Megatron-LM and DeepSpeed both support MoE training on io.net clusters.

Getting Started

  1. Start with Mixtral 8x7B in INT4 on a single H100 ($2.49/hr) to validate your pipeline
  2. Scale to Mixtral 8x22B or DeepSeek V3 as quality requirements grow
  3. Implement quantization early --- the GPU savings are dramatic for MoE
  4. Always use NVLink-connected GPUs for multi-GPU MoE deployments
  5. Monitor expert utilization and rebalance if hotspots appear

MoE architectures are the future of efficient large-scale AI. The hosting challenge is real, but io.net's combination of affordable H100s and flexible NVLink configurations makes it tractable for any team.


Deploy your first MoE model on io.net today. Get started at io.net.