Quantization
Quantization is the process of converting an LLM's weights from high-precision floating-point numbers (typically 16-bit bfloat or float) to lower-precision integers or floats (8-bit, 4-bit, sometimes 2-bit), shrinking memory footprint and speeding up inference with only a small hit to quality. Modern open-source deployment — llama.cpp, Ollama, vLLM, GPTQ, AWQ — runs almost entirely on quantized models.
Quantization is the process of converting an LLM's weights from high-precision floating-point numbers (typically 16-bit bfloat or float) to lower-precision integers or floats (8-bit, 4-bit, sometimes 2-bit), shrinking memory footprint and speeding up inference with only a small hit to quality. Modern open-source deployment — llama.cpp, Ollama, vLLM, GPTQ, AWQ — runs almost entirely on quantized models.
Why It Matters
A 70B-parameter model in native 16-bit format needs ~140 GB of GPU memory — out of reach for a single consumer card. The same model in 4-bit quantization takes ~35 GB and fits on two consumer GPUs, or a single workstation card, or even a Mac Studio. Quantization is what made local LLMs practical: Llama 3 70B, Mixtral 8×22B, and DeepSeek-V3 all run on $3,000 hardware because of quantization. For builders, it's the difference between "we can't afford to self-host" and "we can serve our own models."
How It Works
Precision levels:
- FP32 (32-bit float): Training default. 4 bytes per weight. Rarely used for inference.
- FP16 / BF16 (16-bit): Inference default. 2 bytes per weight.
- INT8 (8-bit integer): Half the memory, near-identical quality.
- INT4 / FP4 (4-bit): Quarter the memory, small quality hit (usually 1–3% on benchmarks).
- INT2 (2-bit): Eighth the memory, noticeable quality loss on most tasks.
Quantization process:
- Calibration: The model is run over a small dataset to observe activation ranges.
- Scale and zero-point calculation: For each weight tensor, compute scaling factors that map the original range to the integer range.
- Weight conversion: Each weight is quantized, stored as an integer plus per-group or per-channel scale factors.
- Dequantization at inference: At compute time, weights are expanded back to floating-point just before the matrix multiply.
Quantization Methods
RTN (Round to Nearest): Dumbest method. Just round weights to nearest quantized value. Fast, low quality.
GPTQ: Group-wise post-training quantization that minimizes reconstruction error. Open-source standard for 4-bit.
AWQ (Activation-aware Weight Quantization): Preserves weights that handle large activations; quantizes the rest more aggressively. Very popular for 4-bit LLMs.
GGUF (Q4_K_M, Q5_K_M, etc.): llama.cpp's family of block-wise quantization formats. K_M variants balance size and accuracy. Q4_K_M is the default local inference format.
SmoothQuant: Moves activation outliers into weights so both can be quantized cleanly. Enables INT8 without much accuracy loss on large models.
QAT (Quantization-Aware Training): Trains the model with quantization in the loop. Best quality but requires re-training.
FP8: A hardware-native 8-bit float format supported by H100/H200 GPUs. Faster than INT8 for training and inference.
Trade-offs
Quality vs compression: The lower the precision, the more the model degrades. 8-bit is almost free; 4-bit is a good default; 2-bit hurts visibly.
Task sensitivity: Math, code, and long reasoning are hit harder by quantization than chat or summarization.
Speed vs memory: Quantization saves memory but doesn't always speed up inference on GPUs with plenty of compute. On memory-bound hardware (consumer GPUs, Apple Silicon), it's a huge speedup.
Calibration data quality: Bad calibration can silently ruin quantized models. Use representative prompts.
Which to Use When
Running on consumer GPU (8–24 GB): 4-bit GGUF (Q4_K_M) or AWQ.
Running on H100/H200: FP8 or INT8 with SmoothQuant.
Edge / mobile: Aggressive 4-bit or 2-bit GGUF; accept quality loss.
Benchmarking research: Keep FP16/BF16 as the reference; quantize only for deployment comparison.
High-stakes production: 8-bit or 16-bit. The marginal cost is worth the quality guarantee.
Common Mistakes
Comparing benchmarks across precisions without noting it: A quantized model's MMLU score isn't directly comparable to the same model's FP16 score.
Ignoring perplexity drift: Even if benchmarks look fine, quantization can degrade specific skills (math especially). Test on your real workload.
Too aggressive too fast: Jumping from FP16 straight to INT2 without testing in between hides where quality broke.
Using stale calibration data: A model calibrated on English only will lose quality on Korean prompts.
Not measuring end-to-end latency: Quantization affects both memory load and compute. Sometimes throughput doesn't improve because the bottleneck was elsewhere.
Sources: