Quantization | Notion

1. The Core Idea

Neural network weights are normally stored in fp16 (16-bit floats) or bfloat16. That's 2 bytes per parameter. A 7B model = ~14 GB just for weights at inference time.

Quantization = store weights in lower precision (8-bit, 4-bit, sometimes 2-bit). Trade a small amount of model quality for big wins in:

Memory (4× reduction going fp16 → int4)
Serving cost (smaller model fits on cheaper hardware, or more requests per GPU)
Latency (less memory bandwidth needed per forward pass — often the actual bottleneck in inference)

The interview one-liner:

"Quantization reduces the bit-width of model weights — typically fp16 down to int8 or int4 — trading a small quality hit for roughly 2× to 4× memory and bandwidth savings. It's primarily a serving optimization."

2. What "8-bit" or "4-bit" Actually Means

Each weight, instead of being a 16-bit float, gets mapped to one of 256 values (int8) or one of 16 values (int4). You store the integer plus a small scale/zero-point per group of weights so you can reconstruct an approximate float at compute time.

fp16 weight: 0.0234567... (16 bits per param) int8 weight: 47 × scale (8 bits + tiny shared scale) int4 weight: 11 × scale (4 bits + tiny shared scale)

The math at compute time either dequantizes back to fp16 on the fly, or runs directly in the lower precision if hardware supports it.

3. Two Flavors That Matter

Post-Training Quantization (PTQ)

Take an already-trained model, quantize the weights, done
No retraining needed
Most common in practice
Methods: GPTQ, AWQ, GGUF (these are names — know that they exist, don't need depth)