What Is Model Quantization? | GEO Glossary

Quantization is the process of converting an LLM's weights from high-precision floating-point numbers (typically 16-bit bfloat or float) to lower-precision integers or floats (8-bit, 4-bit, sometimes 2-bit), shrinking memory footprint and speeding up inference with only a small hit to quality. Modern open-source deployment — llama.cpp, Ollama, vLLM, GPTQ, AWQ — runs almost entirely on quantized models.

Why It Matters

A 70B-parameter model in native 16-bit format needs ~140 GB of GPU memory — out of reach for a single consumer card. The same model in 4-bit quantization takes ~35 GB and fits on two consumer GPUs, or a single workstation card, or even a Mac Studio. Quantization is what made local LLMs practical: Llama 3 70B, Mixtral 8×22B, and DeepSeek-V3 all run on $3,000 hardware because of quantization. For builders, it's the difference between "we can't afford to self-host" and "we can serve our own models."

How It Works

Precision levels:

FP32 (32-bit float): Training default. 4 bytes per weight. Rarely used for inference.
FP16 / BF16 (16-bit): Inference default. 2 bytes per weight.
INT8 (8-bit integer): Half the memory, near-identical quality.
INT4 / FP4 (4-bit): Quarter the memory, small quality hit (usually 1–3% on benchmarks).
INT2 (2-bit): Eighth the memory, noticeable quality loss on most tasks.

Quantization process:

Calibration: The model is run over a small dataset to observe activation ranges.
Scale and zero-point calculation: For each weight tensor, compute scaling factors that map the original range to the integer range.
Weight conversion: Each weight is quantized, stored as an integer plus per-group or per-channel scale factors.
Dequantization at inference: At compute time, weights are expanded back to floating-point just before the matrix multiply.

Quantization Methods

RTN (Round to Nearest): Dumbest method. Just round weights to nearest quantized value. Fast, low quality.

GPTQ: Group-wise post-training quantization that minimizes reconstruction error. Open-source standard for 4-bit.

AWQ (Activation-aware Weight Quantization): Preserves weights that handle large activations; quantizes the rest more aggressively. Very popular for 4-bit LLMs.

GGUF (Q4_K_M, Q5_K_M, etc.): llama.cpp's family of block-wise quantization formats. K_M variants balance size and accuracy. Q4_K_M is the default local inference format.

SmoothQuant: Moves activation outliers into weights so both can be quantized cleanly. Enables INT8 without much accuracy loss on large models.

QAT (Quantization-Aware Training): Trains the model with quantization in the loop. Best quality but requires re-training.

FP8: A hardware-native 8-bit float format supported by H100/H200 GPUs. Faster than INT8 for training and inference.

Trade-offs

Quality vs compression: The lower the precision, the more the model degrades. 8-bit is almost free; 4-bit is a good default; 2-bit hurts visibly.

Task sensitivity: Math, code, and long reasoning are hit harder by quantization than chat or summarization.

Speed vs memory: Quantization saves memory but doesn't always speed up inference on GPUs with plenty of compute. On memory-bound hardware (consumer GPUs, Apple Silicon), it's a huge speedup.

Calibration data quality: Bad calibration can silently ruin quantized models. Use representative prompts.

Which to Use When

Running on consumer GPU (8–24 GB): 4-bit GGUF (Q4_K_M) or AWQ.

Running on H100/H200: FP8 or INT8 with SmoothQuant.

Edge / mobile: Aggressive 4-bit or 2-bit GGUF; accept quality loss.

Benchmarking research: Keep FP16/BF16 as the reference; quantize only for deployment comparison.

High-stakes production: 8-bit or 16-bit. The marginal cost is worth the quality guarantee.

Common Mistakes

Comparing benchmarks across precisions without noting it: A quantized model's MMLU score isn't directly comparable to the same model's FP16 score.

Ignoring perplexity drift: Even if benchmarks look fine, quantization can degrade specific skills (math especially). Test on your real workload.

Too aggressive too fast: Jumping from FP16 straight to INT2 without testing in between hides where quality broke.

Using stale calibration data: A model calibrated on English only will lose quality on Korean prompts.

Not measuring end-to-end latency: Quantization affects both memory load and compute. Sometimes throughput doesn't improve because the bottleneck was elsewhere.

Sources: