Quantization shrinks a model's numerical precision — from 16-bit floats down to 8-bit or even 4-bit integers — reducing memory requirements by 50-75% and boosting inference throughput by 40-100%. A Llama 3 70B model that normally needs 140GB of VRAM (two A100 80GBs) fits on a single GPU after 4-bit quantization, dropping from $2.98/hr to $1.49/hr on io.net. The quality tradeoff is smaller than most people expect — often under 1% on standard benchmarks.
If you're running inference at any meaningful scale and not quantizing, you're almost certainly overspending.
Quantization Methods Ranked
There are several approaches, and they're not all equal. Here's what works in practice:
GPTQ (Post-Training, 4-bit)
The most battle-tested 4-bit method. Calibrates on a small dataset (128-256 examples), takes 1-4 hours for a 70B model, and produces consistently good results.
- Memory reduction: 75% (FP16 → 4-bit)
- Quality loss: 0.5-1.5% on perplexity benchmarks
- Inference speed: 40-60% faster than FP16
- Best for: Production serving on consumer GPUs
AWQ (Activation-Aware, 4-bit)
Newer than GPTQ, slightly better quality in many comparisons. Identifies "salient" weights that matter most and preserves them at higher precision while aggressively quantizing the rest.
- Memory reduction: 75%
- Quality loss: 0.3-1.0% (typically better than GPTQ)
- Inference speed: Similar to GPTQ, slightly faster with TensorRT-LLM
- Best for: When quality matters and you need 4-bit
GGUF (llama.cpp format, 2-8 bit)
Designed for CPU and hybrid CPU-GPU inference. Supports a wide range of quantization levels (Q2, Q3, Q4, Q5, Q6, Q8). Popular for local inference but also useful on GPUs with limited VRAM.
- Memory reduction: 50-87.5% depending on quant level
- Quality loss: Varies widely by level — Q4 is comparable to GPTQ, Q2 is noticeably degraded
- Best for: Flexibility, running on mixed hardware
INT8 (8-bit, conservative)
Halves memory with essentially zero quality loss. If you're nervous about quantization, this is the safe entry point.
- Memory reduction: 50%
- Quality loss: <0.1% (practically none)
- Inference speed: 20-30% faster than FP16
- Best for: When quality is non-negotiable but you still need memory savings
FP8 (H100 native)
Hardware-accelerated 8-bit on H100 GPUs. Same memory savings as INT8 but 2x the inference throughput because it uses tensor cores directly.
- Memory reduction: 50%
- Quality loss: <0.2%
- Inference speed: 80-100% faster than FP16 on H100
- Best for: H100 inference at maximum throughput
Real Cost Impact
Here's where it gets concrete:
Llama 3 70B inference comparison:
| Precision | GPU Needed | io.net Cost | Tokens/sec | Cost per 1M tokens |
|---|---|---|---|---|
| FP16 | 2x A100 80GB | $2.98/hr | 180 | $4.60 |
| INT8 | 1x A100 80GB | $1.49/hr | 240 | $1.72 |
| GPTQ 4-bit | 1x RTX 4090 | $0.18/hr | 48 | $1.04 |
| AWQ 4-bit | 1x A100 40GB | $1.20/hr | 190 | $1.75 |
| FP8 | 1x H100 | $2.20/hr | 420 | $1.45 |
The 4-bit quantized 70B model on a single $0.18/hr RTX 4090 delivers the cheapest cost per token. If you need higher throughput per GPU, FP8 on H100 is the sweet spot.
How to Quantize Your Model
GPTQ quantization with AutoGPTQ:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False
)
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantize_config
)
# Calibrate on sample data
model.quantize(calibration_dataset)
model.save_quantized("./llama3-70b-gptq-4bit")
Time: 2-4 hours on a single A100 80GB for a 70B model. Cost on io.net: $3-$6.
AWQ quantization with AutoAWQ:
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
model.quantize(
tokenizer,
quant_config={"zero_point": True, "q_group_size": 128, "w_bit": 4}
)
model.save_quantized("./llama3-70b-awq-4bit")
Or just download pre-quantized models. TheBloke and other community members maintain GPTQ/AWQ/GGUF versions of most popular models on HuggingFace. No need to quantize yourself unless you're working with a custom model.
Quality Benchmarks: How Much Do You Lose?
On Llama 3 70B across common benchmarks:
| Method | MMLU | HellaSwag | ARC-Challenge | Avg degradation |
|---|---|---|---|---|
| FP16 (baseline) | 79.5 | 87.8 | 68.3 | — |
| INT8 | 79.4 | 87.7 | 68.1 | -0.1% |
| GPTQ 4-bit | 78.8 | 87.0 | 67.5 | -0.8% |
| AWQ 4-bit | 79.0 | 87.3 | 67.8 | -0.5% |
| GPTQ 3-bit | 76.2 | 85.1 | 65.0 | -3.2% |
4-bit quantization costs less than 1% accuracy. For most applications — chatbots, document analysis, coding assistance, content generation — this is imperceptible to end users. 3-bit starts to show noticeable degradation and is only worth it for extreme cost optimization.
Run quantized models on io.net — serve Llama 70B on a single GPU from $0.18/hr. Deploy quantized inference
