What Is Quantization in AI?

AI models are built from billions of numbers called parameters. Each parameter is stored using a certain amount of memory, and when you multiply billions of parameters by even a few bytes each, the total gets large fast. A 70-billion-parameter model stored at standard precision needs about 140GB of memory, far more than the 24GB most GPUs have.

Quantization solves this by storing each parameter with fewer bits. Instead of using 16 bits per number, you can use 8, 4, or even 2. The model gets smaller, loads faster, and runs on hardware that couldn’t otherwise handle it.

The tradeoff is precision. With fewer bits, each number is slightly less accurate, which means the model’s outputs are slightly less precise too. But the loss is smaller than you’d expect. For most practical tasks, a quantized model performs nearly as well as the original while using a fraction of the memory.

How Model Weights Are Stored

Every parameter in a neural network is a number. It represents the strength of a connection between neurons. To store billions of these numbers, the model uses a specific numerical format, and the format determines how much memory the model needs.

FP32 (32-bit floating point) uses 32 bits per number. High precision, but a 7B parameter model needs 28GB just for the weights. This was the standard for training, but it’s wasteful for inference.

FP16 / BF16 (16-bit) cuts storage in half. A 7B model drops to about 14GB. BF16 (bfloat16) keeps the same range as FP32 but with less precision, which works well for neural networks because they tolerate small rounding errors. Most models are distributed in 16-bit format.

INT8 (8-bit integer) maps floating-point weights to 256 discrete values. A 7B model fits in about 7GB. Quality loss is minimal for most tasks.

INT4 (4-bit integer) maps weights to just 16 discrete values. A 7B model fits in about 4GB. Quality loss becomes noticeable on tasks requiring nuance, but remains acceptable for many practical applications.

The pattern: each halving of bit-width roughly halves memory and improves inference speed, while incrementally degrading output quality.

Post-Training Quantization vs. Quantization-Aware Training

There are two fundamentally different approaches to quantization.

Post-training quantization (PTQ) takes a fully trained model and converts its weights to lower precision after the fact. No retraining. You start with a 16-bit model and produce a 4-bit version. This is fast, cheap, and by far the most common approach. Every quantized model you download from Hugging Face or run through Ollama uses some form of PTQ.

The downside of PTQ is that the model never learned to cope with reduced precision. Some weights are more sensitive to rounding than others. Naive rounding can amplify errors through the network.

Quantization-aware training (QAT) simulates quantization during training or fine-tuning. The model learns to produce good outputs even with reduced-precision weights. PyTorch benchmarks show QAT recovers up to 96% of accuracy degradation compared to PTQ on standard benchmarks. The result is a model that performs better at low bit-widths.

The downside of QAT is cost. It requires retraining (or fine-tuning), which needs compute, data, and engineering effort. For most users, PTQ is the practical choice because someone else has already done the quantization and published the result.

Quantization Formats

Three formats dominate the ecosystem, each with different tradeoffs.

GGUF

GGUF (GPT-Generated Unified Format) is the most widely used format for local inference. It was created for llama.cpp and supports mixed-precision quantization with many granularity options.

The naming convention tells you the precision: Q8_0 is 8-bit, Q6_K is 6-bit with K-quant optimization, Q5_K_M is 5-bit medium, Q4_K_M is 4-bit medium, Q2_K is 2-bit. K-quant variants use different precision for different layers based on their sensitivity, which preserves quality better than uniform quantization.

GGUF runs everywhere. It works on CPUs, Apple Silicon, NVIDIA GPUs, and AMD GPUs through llama.cpp and Ollama. When you run ollama run llama3.2, you’re running a GGUF model. The format supports CPU-GPU split inference, where part of the model runs on the GPU and the rest on the CPU, which lets you run models that don’t fully fit in VRAM.

GPTQ

GPTQ (Generative Pre-trained Transformer Quantization) uses a calibration-based approach. It quantizes weights one at a time, using a small calibration dataset to measure and compensate for the error each quantized weight introduces. The remaining un-quantized weights are adjusted to offset the error, preserving overall model behavior.

GPTQ produces 4-bit models that retain roughly 95-96% of full-precision quality. It’s popular for GPU inference and integrates well with vLLM for production serving. The calibration step means GPTQ quantization takes longer than simpler methods, but it’s a one-time cost.

AWQ

AWQ (Activation-Aware Weight Quantization) takes a different approach to deciding which weights matter. Instead of looking at weight magnitudes, it analyzes activation patterns, how much each weight actually affects the model’s output during real inference. The key insight: protecting just 1% of the most important weights (identified by activation magnitude) while aggressively quantizing the remaining 99% reduces quantization error significantly.

AWQ achieves similar or slightly better quality than GPTQ at 4-bit precision, with faster quantization. It’s supported by vLLM and other serving frameworks.

Choosing a Format

Factor	GGUF	GPTQ	AWQ
Hardware	CPU, GPU, Apple Silicon	GPU (NVIDIA)	GPU (NVIDIA)
Precision options	Q2 through Q8	Primarily 4-bit	Primarily 4-bit
Best for	Local/desktop use	GPU production serving	GPU production serving
Tool support	Ollama, llama.cpp	vLLM, AutoGPTQ	vLLM, AutoAWQ
CPU-GPU split	Yes	No	No

For running models on your own machine, GGUF with Q4_K_M is the standard starting point. For serving models on GPU infrastructure, GPTQ or AWQ through vLLM is the production choice.

Quality Impact at Different Bit Widths

Not all quantization levels are equal. The quality degradation is non-linear.

8-bit (Q8): Nearly imperceptible quality loss. Benchmarks typically show less than 1% degradation on standard tasks. If you have the memory, 8-bit is the safe choice, giving you half the memory of FP16 with virtually no quality penalty.

6-bit (Q6_K): Very close to 8-bit quality. A good middle ground for models that almost fit at 8-bit but need a little more compression. Not commonly discussed, but GGUF’s K-quant format makes it a practical option.

4-bit (Q4_K_M): The sweet spot for most users. About 5-10% quality degradation compared to full precision. A 7B model fits in 4GB. A 13B model fits in 8GB. Quality is good enough for summarization, Q&A, code generation, and most conversational tasks. Fine reasoning and nuanced writing show more noticeable degradation.

2-bit (Q2_K): Significant quality drop, roughly 15-20% degradation. Only practical when you need to run a model that absolutely won’t fit at higher precision. A 70B model at 2-bit (about 18GB) becomes runnable on a 24GB GPU, which is otherwise impossible without multi-GPU setups. The quality of a 2-bit 70B model is often comparable to a 4-bit 13B model.

The general rule: quantize to the lowest precision your hardware requires, not lower. If your GPU has 24GB and the 8-bit version fits, use 8-bit. Only go to 4-bit if 8-bit doesn’t fit.

Practical Workflow

For most developers, the workflow is simple.

Using Ollama: Run ollama run llama3.2 and Ollama downloads a pre-quantized GGUF model (4-bit by default). To specify a different quantization, use tags: ollama run llama3.2:8b-q8_0 for 8-bit. Ollama handles everything.

Using Hugging Face models: Search for your model on Hugging Face and look for quantized versions. Model authors and community contributors publish GGUF, GPTQ, and AWQ variants of popular models. Download the variant that matches your hardware.

Self-quantizing: If you need a specific quantization of a model that doesn’t have one published, tools like llama.cpp’s convert script (for GGUF), AutoGPTQ, and AutoAWQ let you quantize from the full-precision weights. This takes time and a machine with enough memory to load the full model, but it’s straightforward.

When Quantization Matters Most

Quantization is most impactful in three scenarios.

Running models locally. Consumer hardware has limited memory. Without quantization, even a 7B model requires 14GB, more than most laptops have available for a single application. At 4-bit, that same model runs comfortably on a machine with 8GB of RAM.

Reducing inference cost. Smaller models fit on smaller (cheaper) GPUs. A 4-bit 70B model that runs on a single A100 (80GB) would need two A100s at 16-bit. Halving the GPU count roughly halves the cost.

Improving inference speed. Fewer bits per weight means less data to move from memory to the compute units. Since the decode phase of inference is memory-bound, reducing memory bandwidth requirements directly increases tokens per second. A 4-bit model typically generates tokens 2-3x faster than the same model at 16-bit.

Quantization doesn’t help with training (which needs full precision for gradient updates) and doesn’t change the model’s architecture, knowledge, or capabilities. It only affects how precisely the existing weights are stored.

The Bigger Picture

Quantization is one piece of the inference optimization stack. Combined with efficient serving, batching, and hardware-aware deployment, it enables AI systems that would otherwise be impractical. The 7B model running on your laptop at 30 tokens per second? That’s quantization making the math work. The startup serving a 70B model on a single GPU? Same thing.

For a deeper treatment of quantization tradeoffs, model selection, and building production inference pipelines, see Get Insanely Good at AI at getaibook.com/book.

What Is Quantization in AI?

How Model Weights Are Stored

Post-Training Quantization vs. Quantization-Aware Training

Quantization Formats

GGUF

GPTQ

AWQ

Choosing a Format

Quality Impact at Different Bit Widths

Practical Workflow

When Quantization Matters Most

The Bigger Picture

Keep Reading

Google Is Solving the LLM Memory Bottleneck with TurboQuant

What Is AI Inference and How Does It Work?

How to Stop OCR Degeneration With DharmaOCR Lite 3B

How to Cut Checkpoint Time by 85% With TRL Delta Weight Sync

NVIDIA Nemotron-Labs-Diffusion Yields 6x TPF Over Qwen3-8B