Mixture of Experts models — Mixtral, DeepSeek-V2, Switch Transformer, and the architecture behind GPT-4 — are transforming how people think about model scaling. They activate only a fraction of their parameters per token, so a 45B-parameter MoE model can run almost as fast as a dense 12B model while delivering quality closer to a dense 45B model.

But MoE infrastructure has quirks that catch people off guard. The models are huge on disk (all parameters exist even if only some activate), memory requirements are unusual (you need enough VRAM for all expert weights even though most are idle per token), and training requires careful load balancing across GPUs.

Understanding MoE Memory Requirements

Here's the unintuitive part: a model with 8 experts, each 7B parameters, needs memory for all 56B parameters — even though any single token only passes through 2 experts (14B activated). The router and shared layers add another 2-5B.

Mixtral 8x7B as a concrete example:
- Total parameters: ~47B
- Active parameters per token: ~13B
- Model weights in FP16: ~94GB
- Minimum VRAM for inference: 48GB (quantized) to 94GB (FP16)
- Minimum VRAM for training: 200GB+ (weights + optimizer states + activations)

This means inference on Mixtral requires either:
- 2x A100 80GB with tensor parallelism ($2.98/hr on io.net)
- 1x H100 80GB at 4-bit quantization ($2.20/hr on io.net)
- 4x RTX 4090 with expert parallelism ($0.72/hr on io.net — cheapest but more complex)

Training Mixtral-scale models requires 8x A100 80GB minimum, realistically 16-32 GPUs for reasonable batch sizes.

MoE Training: What's Different

Standard dense model training distributes data across GPUs (data parallelism). MoE models add expert parallelism — different experts live on different GPUs, and tokens get routed to the right GPU based on the gating network's decision.

Key challenges:

1. Expert load balancing
If all tokens route to the same 2 experts while 6 others sit idle, you've wasted 75% of your GPU capacity. The router must distribute work evenly. This is handled by auxiliary load-balancing losses during training, but it requires monitoring and tuning.

2. All-to-all communication
After the router decides which expert handles which token, tokens need to physically move to the GPU holding that expert. This all-to-all communication pattern is more complex than the AllReduce used in dense training. NVLink or InfiniBand is not optional — PCIe bandwidth will bottleneck you badly.

3. Memory fragmentation
With 8 experts, 8 optimizer states, and dynamic activation patterns, memory allocation is less predictable than dense models. Expect to need 10-20% more memory headroom than the theoretical minimum.

Mixtral 8x7B inference:
| Config | Monthly cost (io.net, 24/7) | Tokens/sec | Use case |
|--------|---------------------------|-----------|----------|
| 2x A100 80GB | $2,145 | 85 | Production API, FP16 |
| 1x H100 SXM | $1,584 | 140 | High-throughput API |
| 4x RTX 4090 (GPTQ) | $518 | 55 | Cost-optimized, quantized |

Mixtral 8x7B fine-tuning (LoRA):
| Config | Time per epoch | Total cost |
|--------|---------------|-----------|
| 4x A100 80GB | 18 hours | $107 |
| 8x H100 SXM | 6 hours | $106 |

Training a custom MoE (8x3B experts, 24B total):
| Config | Time (100K steps) | Total cost |
|--------|-------------------|-----------|
| 8x A100 80GB (NVLink) | 72 hours | $860 |
| 16x A100 80GB (2 nodes) | 40 hours | $955 |
| 8x H100 SXM (NVLink) | 30 hours | $528 |

Framework Support

Megablocks (recommended for training custom MoE):
Efficient MoE training with block-sparse operations. Handles expert parallelism and load balancing natively. Works with PyTorch.

DeepSpeed-MoE:
Microsoft's implementation. Good for scaling to many nodes with ZeRO optimization. Handles expert parallelism, data parallelism, and pipeline parallelism simultaneously.

vLLM (for inference):
Supports Mixtral and other MoE architectures out of the box. Handles expert routing and tensor parallelism across multiple GPUs automatically. Recommended for production serving.

HuggingFace Transformers:
Supports Mixtral inference and LoRA fine-tuning. For training from scratch, you'll need Megablocks or DeepSpeed-MoE.

When MoE Makes Sense

Pick MoE over dense models when:
- You need 70B-level quality at 13B-level inference speed
- Your training budget supports the higher total parameter count
- You have multi-GPU infrastructure with fast interconnects
- Inference cost matters more than training cost (activate fewer parameters per token = cheaper serving)

Stick with dense models when:
- You're working with single GPUs (MoE overhead isn't worth it below 4 GPUs)
- Your model is under 13B parameters (MoE benefits diminish at small scale)
- Simple deployment is a priority (dense models are easier to serve)


Train and serve MoE models on io.net — Multi-GPU clusters with NVLink from $11.92/hr. Build your cluster