Neural network weights are normally stored in fp16 (16-bit floats) or bfloat16. That's 2 bytes per parameter. A 7B model = ~14 GB just for weights at inference time.
Quantization = store weights in lower precision (8-bit, 4-bit, sometimes 2-bit). Trade a small amount of model quality for big wins in:
The interview one-liner:
"Quantization reduces the bit-width of model weights — typically fp16 down to int8 or int4 — trading a small quality hit for roughly 2× to 4× memory and bandwidth savings. It's primarily a serving optimization."
Each weight, instead of being a 16-bit float, gets mapped to one of 256 values (int8) or one of 16 values (int4). You store the integer plus a small scale/zero-point per group of weights so you can reconstruct an approximate float at compute time.
fp16 weight: 0.0234567... (16 bits per param) int8 weight: 47 × scale (8 bits + tiny shared scale) int4 weight: 11 × scale (4 bits + tiny shared scale)
The math at compute time either dequantizes back to fp16 on the fly, or runs directly in the lower precision if hardware supports it.