What Is Mixed Precision Training and How Does It Speed Up GPU Workloads?

Mixed precision training uses a combination of 16-bit (FP16 or BF16) and 32-bit (FP32) floating point numbers during model training. The core idea: do most of the heavy math in lower precision for speed, but keep a master copy of the weights in full precision for accuracy. It's one of those rare optimizations that's almost entirely upside — nearly 2x faster training, 50% less memory usage, and negligible quality loss.

Every serious training run in 2026 uses mixed precision. If yours doesn't, you're leaving performance on the table.

How It Actually Works

During a forward and backward pass, the model's weights and activations are computed in FP16 or BF16. These smaller numbers move through the GPU's tensor cores faster and take up half the memory. But gradients can be very small — small enough that FP16 rounds them to zero (the "underflow" problem). So PyTorch's GradScaler multiplies the loss by a large number before backpropagation, keeping gradients in a representable range, then divides them back down before the optimizer step.

The optimizer itself (AdamW, SGD, etc.) keeps a full FP32 copy of the weights. This master copy accumulates the tiny gradient updates accurately over thousands of steps. After each step, the FP32 weights are cast back to FP16 for the next forward pass.

The Performance Gains

On NVIDIA GPUs with tensor cores (anything from V100 onward), mixed precision delivers:

GPU	FP32 Training	Mixed Precision (BF16)	Speedup
RTX 4090	83 TFLOPS	165 TFLOPS (FP16)	1.9x
A100 80GB	156 TFLOPS (TF32)	312 TFLOPS (FP16)	2.0x
H100 SXM	495 TFLOPS (TF32)	990 TFLOPS (FP16)	2.0x

Practical training speedups are typically 1.5-1.8x (not the full 2x theoretical) because data loading, communication, and other operations don't benefit from lower precision.

Memory savings are more consistent: a 7B model drops from 28GB to 14GB in FP16, freeing room for larger batches or bigger models.

BF16 vs FP16: Which to Use

They're both 16 bits but differently allocated:

FP16: 5 exponent bits, 10 mantissa bits. Higher precision, smaller dynamic range. Requires loss scaling to avoid gradient underflow.
BF16: 8 exponent bits, 7 mantissa bits. Same range as FP32, lower precision. No loss scaling needed. Slightly noisier but more robust.

Use BF16 if your GPU supports it (A100, H100, RTX 40-series). It's simpler (no scaler needed) and more numerically stable. Use FP16 only on older GPUs (V100, RTX 30-series) that lack BF16 support.

Implementation in PyTorch

It's a few lines of code. Seriously:

# BF16 on A100/H100 (recommended)
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    output = model(input)
    loss = criterion(output, target)
loss.backward()
optimizer.step()

# FP16 on older GPUs (needs GradScaler)
scaler = torch.GradScaler()
with torch.autocast(device_type="cuda", dtype=torch.float16):
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

With HuggingFace Trainer, it's even simpler:

training_args = TrainingArguments(
    bf16=True,  # or fp16=True for older GPUs
    # ... other args
)

When Mixed Precision Doesn't Work

There are edge cases where full precision is necessary:

Very small learning rates with FP16 (gradient underflow despite scaling)
Certain loss functions that produce very large or very small intermediate values
Accumulation-heavy operations where rounding errors compound (rare in practice)
Some GAN training setups where discriminator/generator balance is delicate

If you hit NaN losses after enabling mixed precision, try switching from FP16 to BF16 first. If that's not available, increase the initial scale factor or exclude specific layers from autocasting.

Train faster with mixed precision on io.net — A100 with BF16 support from $1.20/hr. Deploy now