FAQ: What Are Model Quantization Techniques and How Do They Cut Inference Costs?

Quantization shrinks a model's numerical precision — from 16-bit floats down to 8-bit or even 4-bit integers — reducing memory requirements by 50-75% and boosting inference throughput by 40-100%. A Llama 3 70B model that normally needs 140GB of VRAM (two A100 80GBs) fits on a single GPU after 4-bit quantization, dropping from $2.98/hr to $1.49/hr on io.net. The quality tradeoff is smaller than most people expect — often under 1% on standard benchmarks.

If you're running inference at any meaningful scale and not quantizing, you're almost certainly overspending.

Quantization Methods Ranked

There are several approaches, and they're not all equal. Here's what works in practice:

GPTQ (Post-Training, 4-bit)
The most battle-tested 4-bit method. Calibrates on a small dataset (128-256 examples), takes 1-4 hours for a 70B model, and produces consistently good results.
- Memory reduction: 75% (FP16 → 4-bit)
- Quality loss: 0.5-1.5% on perplexity benchmarks
- Inference speed: 40-60% faster than FP16
- Best for: Production serving on consumer GPUs

AWQ (Activation-Aware, 4-bit)
Newer than GPTQ, slightly better quality in many comparisons. Identifies "salient" weights that matter most and preserves them at higher precision while aggressively quantizing the rest.
- Memory reduction: 75%
- Quality loss: 0.3-1.0% (typically better than GPTQ)
- Inference speed: Similar to GPTQ, slightly faster with TensorRT-LLM
- Best for: When quality matters and you need 4-bit

GGUF (llama.cpp format, 2-8 bit)
Designed for CPU and hybrid CPU-GPU inference. Supports a wide range of quantization levels (Q2, Q3, Q4, Q5, Q6, Q8). Popular for local inference but also useful on GPUs with limited VRAM.
- Memory reduction: 50-87.5% depending on quant level
- Quality loss: Varies widely by level — Q4 is comparable to GPTQ, Q2 is noticeably degraded
- Best for: Flexibility, running on mixed hardware

INT8 (8-bit, conservative)
Halves memory with essentially zero quality loss. If you're nervous about quantization, this is the safe entry point.
- Memory reduction: 50%
- Quality loss: <0.1% (practically none)
- Inference speed: 20-30% faster than FP16
- Best for: When quality is non-negotiable but you still need memory savings

FP8 (H100 native)
Hardware-accelerated 8-bit on H100 GPUs. Same memory savings as INT8 but 2x the inference throughput because it uses tensor cores directly.
- Memory reduction: 50%
- Quality loss: <0.2%
- Inference speed: 80-100% faster than FP16 on H100
- Best for: H100 inference at maximum throughput

Real Cost Impact

Here's where it gets concrete:

Llama 3 70B inference comparison:

Precision	GPU Needed	io.net Cost	Tokens/sec	Cost per 1M tokens
FP16	2x A100 80GB	$2.98/hr	180	$4.60
INT8	1x A100 80GB	$1.49/hr	240	$1.72
GPTQ 4-bit	1x RTX 4090	$0.18/hr	48	$1.04
AWQ 4-bit	1x A100 40GB	$1.20/hr	190	$1.75
FP8	1x H100	$2.20/hr	420	$1.45

The 4-bit quantized 70B model on a single $0.18/hr RTX 4090 delivers the cheapest cost per token. If you need higher throughput per GPU, FP8 on H100 is the sweet spot.

How to Quantize Your Model

GPTQ quantization with AutoGPTQ:

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False
)

model = AutoGPTQForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantize_config
)

# Calibrate on sample data
model.quantize(calibration_dataset)
model.save_quantized("./llama3-70b-gptq-4bit")

Time: 2-4 hours on a single A100 80GB for a 70B model. Cost on io.net: $3-$6.

AWQ quantization with AutoAWQ:

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
model.quantize(
    tokenizer,
    quant_config={"zero_point": True, "q_group_size": 128, "w_bit": 4}
)
model.save_quantized("./llama3-70b-awq-4bit")

Or just download pre-quantized models. TheBloke and other community members maintain GPTQ/AWQ/GGUF versions of most popular models on HuggingFace. No need to quantize yourself unless you're working with a custom model.

Quality Benchmarks: How Much Do You Lose?

On Llama 3 70B across common benchmarks:

Method	MMLU	HellaSwag	ARC-Challenge	Avg degradation
FP16 (baseline)	79.5	87.8	68.3	—
INT8	79.4	87.7	68.1	-0.1%
GPTQ 4-bit	78.8	87.0	67.5	-0.8%
AWQ 4-bit	79.0	87.3	67.8	-0.5%
GPTQ 3-bit	76.2	85.1	65.0	-3.2%

4-bit quantization costs less than 1% accuracy. For most applications — chatbots, document analysis, coding assistance, content generation — this is imperceptible to end users. 3-bit starts to show noticeable degradation and is only worth it for extreme cost optimization.

Run quantized models on io.net — serve Llama 70B on a single GPU from $0.18/hr. Deploy quantized inference