What Is Quantization in AI?
Quantization shrinks AI models by reducing numerical precision. Here's how it works, what formats exist, and how to choose the right tradeoff between size, speed, and quality.
AI models are built from billions of numbers called parameters. Each parameter is stored using a certain amount of memory, and when you multiply billions of parameters by even a few bytes each, the total gets large fast. A 70-billion-parameter model stored at standard precision needs about 140GB of memory, far more than the 24GB most GPUs have.
Quantization solves this by storing each parameter with fewer bits. Instead of using 16 bits per number, you can use 8, 4, or even 2. The model gets smaller, loads faster, and runs on hardware that couldn’t otherwise handle it.
The tradeoff is precision. With fewer bits, each number is slightly less accurate, which means the model’s outputs are slightly less precise too. But the loss is smaller than you’d expect. For most practical tasks, a quantized model performs nearly as well as the original while using a fraction of the memory.
How Model Weights Are Stored
Every parameter in a neural network is a number. It represents the strength of a connection between neurons. To store billions of these numbers, the model uses a specific numerical format, and the format determines how much memory the model needs.
FP32 (32-bit floating point) uses 32 bits per number. High precision, but a 7B parameter model needs 28GB just for the weights. This was the standard for training, but it’s wasteful for inference.
FP16 / BF16 (16-bit) cuts storage in half. A 7B model drops to about 14GB. BF16 (bfloat16) keeps the same range as FP32 but with less precision, which works well for neural networks because they tolerate small rounding errors. Most models are distributed in 16-bit format.
INT8 (8-bit integer) maps floating-point weights to 256 discrete values. A 7B model fits in about 7GB. Quality loss is minimal for most tasks.
INT4 (4-bit integer) maps weights to just 16 discrete values. A 7B model fits in about 4GB. Quality loss becomes noticeable on tasks requiring nuance, but remains acceptable for many practical applications.
The pattern: each halving of bit-width roughly halves memory and improves inference speed, while incrementally degrading output quality.
Post-Training Quantization vs. Quantization-Aware Training
There are two fundamentally different approaches to quantization.
Post-training quantization (PTQ) takes a fully trained model and converts its weights to lower precision after the fact. No retraining. You start with a 16-bit model and produce a 4-bit version. This is fast, cheap, and by far the most common approach. Every quantized model you download from Hugging Face or run through Ollama uses some form of PTQ.
The downside of PTQ is that the model never learned to cope with reduced precision. Some weights are more sensitive to rounding than others. Naive rounding can amplify errors through the network.
Quantization-aware training (QAT) simulates quantization during training or fine-tuning. The model learns to produce good outputs even with reduced-precision weights. PyTorch benchmarks show QAT recovers up to 96% of accuracy degradation compared to PTQ on standard benchmarks. The result is a model that performs better at low bit-widths.
The downside of QAT is cost. It requires retraining (or fine-tuning), which needs compute, data, and engineering effort. For most users, PTQ is the practical choice because someone else has already done the quantization and published the result.
Quantization Formats
Three formats dominate the ecosystem, each with different tradeoffs.
GGUF
GGUF (GPT-Generated Unified Format) is the most widely used format for local inference. It was created for llama.cpp and supports mixed-precision quantization with many granularity options.
The naming convention tells you the precision: Q8_0 is 8-bit, Q6_K is 6-bit with K-quant optimization, Q5_K_M is 5-bit medium, Q4_K_M is 4-bit medium, Q2_K is 2-bit. K-quant variants use different precision for different layers based on their sensitivity, which preserves quality better than uniform quantization.
GGUF runs everywhere. It works on CPUs, Apple Silicon, NVIDIA GPUs, and AMD GPUs through llama.cpp and Ollama. When you run ollama run llama3.2, you’re running a GGUF model. The format supports CPU-GPU split inference, where part of the model runs on the GPU and the rest on the CPU, which lets you run models that don’t fully fit in VRAM.
GPTQ
GPTQ (Generative Pre-trained Transformer Quantization) uses a calibration-based approach. It quantizes weights one at a time, using a small calibration dataset to measure and compensate for the error each quantized weight introduces. The remaining un-quantized weights are adjusted to offset the error, preserving overall model behavior.
GPTQ produces 4-bit models that retain roughly 95-96% of full-precision quality. It’s popular for GPU inference and integrates well with vLLM for production serving. The calibration step means GPTQ quantization takes longer than simpler methods, but it’s a one-time cost.
AWQ
AWQ (Activation-Aware Weight Quantization) takes a different approach to deciding which weights matter. Instead of looking at weight magnitudes, it analyzes activation patterns, how much each weight actually affects the model’s output during real inference. The key insight: protecting just 1% of the most important weights (identified by activation magnitude) while aggressively quantizing the remaining 99% reduces quantization error significantly.
AWQ achieves similar or slightly better quality than GPTQ at 4-bit precision, with faster quantization. It’s supported by vLLM and other serving frameworks.
Choosing a Format
| Factor | GGUF | GPTQ | AWQ |
|---|---|---|---|
| Hardware | CPU, GPU, Apple Silicon | GPU (NVIDIA) | GPU (NVIDIA) |
| Precision options | Q2 through Q8 | Primarily 4-bit | Primarily 4-bit |
| Best for | Local/desktop use | GPU production serving | GPU production serving |
| Tool support | Ollama, llama.cpp | vLLM, AutoGPTQ | vLLM, AutoAWQ |
| CPU-GPU split | Yes | No | No |
For running models on your own machine, GGUF with Q4_K_M is the standard starting point. For serving models on GPU infrastructure, GPTQ or AWQ through vLLM is the production choice.
Quality Impact at Different Bit Widths
Not all quantization levels are equal. The quality degradation is non-linear.
8-bit (Q8): Nearly imperceptible quality loss. Benchmarks typically show less than 1% degradation on standard tasks. If you have the memory, 8-bit is the safe choice, giving you half the memory of FP16 with virtually no quality penalty.
6-bit (Q6_K): Very close to 8-bit quality. A good middle ground for models that almost fit at 8-bit but need a little more compression. Not commonly discussed, but GGUF’s K-quant format makes it a practical option.
4-bit (Q4_K_M): The sweet spot for most users. About 5-10% quality degradation compared to full precision. A 7B model fits in 4GB. A 13B model fits in 8GB. Quality is good enough for summarization, Q&A, code generation, and most conversational tasks. Fine reasoning and nuanced writing show more noticeable degradation.
2-bit (Q2_K): Significant quality drop, roughly 15-20% degradation. Only practical when you need to run a model that absolutely won’t fit at higher precision. A 70B model at 2-bit (about 18GB) becomes runnable on a 24GB GPU, which is otherwise impossible without multi-GPU setups. The quality of a 2-bit 70B model is often comparable to a 4-bit 13B model.
The general rule: quantize to the lowest precision your hardware requires, not lower. If your GPU has 24GB and the 8-bit version fits, use 8-bit. Only go to 4-bit if 8-bit doesn’t fit.
Practical Workflow
For most developers, the workflow is simple.
Using Ollama: Run ollama run llama3.2 and Ollama downloads a pre-quantized GGUF model (4-bit by default). To specify a different quantization, use tags: ollama run llama3.2:8b-q8_0 for 8-bit. Ollama handles everything.
Using Hugging Face models: Search for your model on Hugging Face and look for quantized versions. Model authors and community contributors publish GGUF, GPTQ, and AWQ variants of popular models. Download the variant that matches your hardware.
Self-quantizing: If you need a specific quantization of a model that doesn’t have one published, tools like llama.cpp’s convert script (for GGUF), AutoGPTQ, and AutoAWQ let you quantize from the full-precision weights. This takes time and a machine with enough memory to load the full model, but it’s straightforward.
When Quantization Matters Most
Quantization is most impactful in three scenarios.
Running models locally. Consumer hardware has limited memory. Without quantization, even a 7B model requires 14GB, more than most laptops have available for a single application. At 4-bit, that same model runs comfortably on a machine with 8GB of RAM.
Reducing inference cost. Smaller models fit on smaller (cheaper) GPUs. A 4-bit 70B model that runs on a single A100 (80GB) would need two A100s at 16-bit. Halving the GPU count roughly halves the cost.
Improving inference speed. Fewer bits per weight means less data to move from memory to the compute units. Since the decode phase of inference is memory-bound, reducing memory bandwidth requirements directly increases tokens per second. A 4-bit model typically generates tokens 2-3x faster than the same model at 16-bit.
Quantization doesn’t help with training (which needs full precision for gradient updates) and doesn’t change the model’s architecture, knowledge, or capabilities. It only affects how precisely the existing weights are stored.
The Bigger Picture
Quantization is one piece of the inference optimization stack. Combined with efficient serving, batching, and hardware-aware deployment, it enables AI systems that would otherwise be impractical. The 7B model running on your laptop at 30 tokens per second? That’s quantization making the math work. The startup serving a 70B model on a single GPU? Same thing.
For a deeper treatment of quantization tradeoffs, model selection, and building production inference pipelines, see Get Insanely Good at AI at getaibook.com/book.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Gimlet Labs Raises $80M Series A for AI Inference
Gimlet Labs raised an $80 million Series A led by Menlo Ventures to scale its multi-silicon AI inference cloud.
What Is AI Inference and How Does It Work?
Inference is where AI models do their actual work. Here's what happens during inference, why it's the bottleneck, and what determines speed and cost.
What Are Parameters in AI Models?
Parameters are the numbers that make AI models work. Here's what they are, why models have billions of them, and what the count actually tells you about capability.
Continued Pretraining vs RAG: Two Ways to Add Knowledge
Continued pretraining bakes knowledge into model weights. RAG injects it at query time. When to use each, where each breaks down, and why you often need both.
How to Run NVIDIA Nemotron 3 Nano 4B Locally on Jetson and RTX
Learn to deploy NVIDIA's Nemotron 3 Nano 4B locally with BF16, FP8, or GGUF on Jetson, RTX, vLLM, TensorRT-LLM, and llama.cpp.