The NVIDIA H100 comes in two form factors: SXM (Server Module with NVLink) and PCIe (standard PCI Express card). The SXM version offers 700W TDP, 900 GB/s NVLink interconnect, and is designed for multi-GPU training workloads in dense server configurations. The PCIe version has 350W TDP, no NVLink, and is optimized for single-GPU inference or installations in standard servers.
For most AI workloads, SXM provides 40-60% better performance in multi-GPU training scenarios due to NVLink's high-bandwidth GPU-to-GPU communication (900 GB/s vs. PCIe 4.0's 64 GB/s). However, PCIe is 20-30% cheaper, more widely available, and perfectly adequate for inference, fine-tuning smaller models, or single-GPU workloads.
Technical Specifications Comparison
| Specification | H100 SXM | H100 PCIe |
|---|---|---|
| Form Factor | Proprietary SXM5 module | Standard PCIe Gen 5 x16 card |
| TDP (Power) | 700W | 350W |
| GPU Memory | 80GB HBM3 | 80GB HBM3 |
| Memory Bandwidth | 3.35 TB/s | 2.0 TB/s |
| NVLink | 900 GB/s (18 links) | None |
| PCIe Interface | PCIe Gen 5 x16 | PCIe Gen 5 x16 |
| Compute (FP16) | 1,979 TFLOPS | 1,513 TFLOPS |
| Compute (INT8) | 3,958 TOPS | 3,026 TOPS |
| Dimensions | Requires HGX H100 server | Standard dual-slot card |
| Price (io.net) | $2.20/hour | $1.49/hour |
| Availability | Limited (data center only) | More widely available |
Performance Differences
Multi-GPU Training (8x H100):
Llama 3 70B Full Fine-Tuning (10K samples, 3 epochs)
SXM Configuration:
- 8x H100 SXM with NVLink (900 GB/s interconnect)
- Training time: 24 hours
- Throughput: 12,500 tokens/second
- Cost: $17.60/hour × 24 = $422
- GPU utilization: 92%
PCIe Configuration:
- 8x H100 PCIe without NVLink (64 GB/s PCIe 5.0)
- Training time: 38 hours (+58% slower)
- Throughput: 7,900 tokens/second
- Cost: $11.92/hour × 38 = $453
- GPU utilization: 78% (communication bottleneck)
Winner: SXM (faster completion, lower total cost despite higher hourly rate)
Single-GPU Inference:
Llama 3 8B Inference Serving (vLLM)
SXM Configuration:
- 1x H100 SXM
- Throughput: 85 tokens/second
- Latency (TTFT): 18ms
- Cost: $2.20/hour
- Power: 700W
PCIe Configuration:
- 1x H100 PCIe
- Throughput: 82 tokens/second (4% slower)
- Latency (TTFT): 19ms
- Cost: $1.49/hour (32% cheaper)
- Power: 350W
Winner: PCIe (nearly identical performance, significantly cheaper)
When to Use Each
Choose H100 SXM When:
- Multi-GPU Training (4+ GPUs)
- Large language model training (70B+ parameters)
- Distributed training with high inter-GPU communication
- Model parallelism (tensor parallelism, pipeline parallelism)
- Example: Training GPT-style models, Stable Diffusion XL fine-tuning - Maximum Performance Required
- Time-sensitive research deadlines
- Production training pipelines where speed matters
- Benchmarking and competitive ML competitions - High-Throughput Batch Inference
- Serving large models (70B+) with high concurrency
- Batch processing millions of requests per day
- Example: Enterprise LLM API serving at scale
Choose H100 PCIe When:
- Single-GPU Workloads
- Inference serving for 7B-13B models
- Fine-tuning smaller models (<30B parameters)
- Development and experimentation
- Example: Hosting Llama 3 8B API endpoint - Cost Optimization
- Budget-constrained projects
- Long-running inference workloads (cost adds up)
- Example: 24/7 chatbot serving ($1.49/hr × 730 hrs = $1,088/month vs. $1,606/month for SXM) - Standard Server Infrastructure
- Deploying in existing PCIe servers
- Edge deployments without HGX chassis
- Easier hardware sourcing and flexibility - Power-Constrained Environments
- Colocation facilities with power limits
- Sustainability initiatives (50% lower power consumption)
Architecture Differences
H100 SXM Architecture:
HGX H100 Server Chassis:
┌─────────────────────────────────┐
│ 8x H100 SXM modules │
│ ├─ NVLink Switch (900 GB/s) │
│ ├─ All-to-all interconnect │
│ └─ Shared power/cooling │
│ │
│ Requirements: │
│ - HGX basebone (~$50K) │
│ - 8x 700W PSU (5.6kW total) │
│ - Liquid cooling │
│ - Specialized chassis │
└─────────────────────────────────┘
H100 PCIe Architecture:
Standard Server:
┌─────────────────────────────────┐
│ 8x H100 PCIe cards │
│ ├─ PCIe 5.0 x16 per GPU │
│ ├─ No direct GPU interconnect │
│ └─ Air-cooled │
│ │
│ Requirements: │
│ - Standard server chassis │
│ - 8x 350W PSU (2.8kW total) │
│ - Air cooling (fans) │
│ - Commodity hardware │
└─────────────────────────────────┘
Real-World Benchmarks
Multi-GPU Training Performance:
| Workload | Metric | 8x SXM | 8x PCIe | SXM Advantage |
|---|---|---|---|---|
| Llama 3 70B training | Samples/sec | 185 | 117 | +58% |
| Stable Diffusion XL | Images/sec | 420 | 290 | +45% |
| GPT-NeoX 20B | Tokens/sec | 28,500 | 19,200 | +48% |
| BERT fine-tuning | Steps/sec | 145 | 112 | +29% |
Single-GPU Inference Performance:
| Model | Metric | SXM | PCIe | Difference |
|---|---|---|---|---|
| Llama 3 8B | Tokens/sec | 85 | 82 | +4% |
| Llama 3 70B | Tokens/sec | 38 | 36 | +6% |
| Mistral 7B | Tokens/sec | 92 | 89 | +3% |
| SD 1.5 | Images/sec | 12 | 11.5 | +4% |
Verdict: SXM provides minimal advantage for single-GPU workloads (not worth 48% price premium).
Memory Bandwidth Impact
SXM: 3.35 TB/s HBM3 Bandwidth
- Faster weight loading from VRAM
- Better performance on memory-bound operations
- Advantage for inference with large batch sizes
PCIe: 2.0 TB/s HBM3 Bandwidth
- Sufficient for most workloads
- Bottleneck only in extreme cases (very large batches)
Real-World Impact:
- Inference: 3-5% performance difference
- Training: 8-12% difference (memory-bound operations)
- Fine-tuning: 5-8% difference
Cost Analysis: 30-Day Deployment
Inference Workload (24/7 Llama 3 8B serving):
SXM: $2.20/hour × 720 hours = $1,584
PCIe: $1.49/hour × 720 hours = $1,073
Savings: $511/month (32% cheaper)
Performance difference: 4% (negligible for inference)
Winner: PCIe (massive cost savings, minimal performance loss)
Training Workload (8x GPU, 200 hours/month):
SXM: $17.60/hour × 200 hours = $3,520
PCIe: $11.92/hour × 315 hours = $3,755
(PCIe requires 58% longer to complete same training)
Winner: SXM (faster time-to-result, lower total cost despite higher hourly rate)
Availability on io.net
| GPU Type | Availability | Avg. Wait Time | Typical io.net Price |
|---|---|---|---|
| H100 SXM | Limited | 0-5 minutes | $2.20/hour |
| H100 PCIe | Good | Instant | $1.49/hour |
| A100 SXM | Excellent | Instant | $1.10/hour |
| A100 PCIe | Excellent | Instant | $0.73/hour |
Recommendation Matrix
| Your Workload | Recommended | Reason |
|---|---|---|
| Single-GPU inference | PCIe | 32% cheaper, <5% slower |
| Multi-GPU training (8+ GPUs) | SXM | 45-60% faster, worth premium |
| Fine-tuning 7B-13B | PCIe | Sufficient performance, lower cost |
| Fine-tuning 70B+ | SXM | Faster convergence, better ROI |
| 24/7 inference serving | PCIe | Cost savings compound over time |
| Time-sensitive research | SXM | Maximum speed, results matter more than cost |
Deploy H100 SXM or PCIe on io.net with instant availability and 60-70% cost savings vs. AWS.
