1. The Core Idea

Neural network weights are normally stored in fp16 (16-bit floats) or bfloat16. That's 2 bytes per parameter. A 7B model = ~14 GB just for weights at inference time.

Quantization = store weights in lower precision (8-bit, 4-bit, sometimes 2-bit). Trade a small amount of model quality for big wins in:

The interview one-liner:

"Quantization reduces the bit-width of model weights — typically fp16 down to int8 or int4 — trading a small quality hit for roughly 2× to 4× memory and bandwidth savings. It's primarily a serving optimization."


2. What "8-bit" or "4-bit" Actually Means

Each weight, instead of being a 16-bit float, gets mapped to one of 256 values (int8) or one of 16 values (int4). You store the integer plus a small scale/zero-point per group of weights so you can reconstruct an approximate float at compute time.

fp16 weight: 0.0234567... (16 bits per param) int8 weight: 47 × scale (8 bits + tiny shared scale) int4 weight: 11 × scale (4 bits + tiny shared scale)

The math at compute time either dequantizes back to fp16 on the fly, or runs directly in the lower precision if hardware supports it.


3. Two Flavors That Matter

Post-Training Quantization (PTQ)