Ai Engineering 8 min read

What Is AI Inference and How Does It Work?

Inference is where AI models do their actual work. Here's what happens during inference, why it's the bottleneck, and what determines speed and cost.

Training teaches a model. Inference is where it does work. Every time you send a prompt to ChatGPT, call an LLM API, or run a model on your laptop, you’re running inference. It’s the phase that determines how fast you get a response, how much it costs, and whether you can serve a thousand users at once.

Training vs. Inference

Training adjusts a model’s parameters (weights) to learn patterns from data. It’s expensive, slow, and happens once (or occasionally, for fine-tuning). A frontier model might take months and tens of millions of dollars to train. But once training is done, the weights are fixed.

Inference uses those fixed weights to generate output from new input. No learning happens. The model applies what it already knows. Every API call, every chat message, every code suggestion is inference. The economics of AI are shifting decisively toward inference. By some estimates, for every dollar spent training a frontier model, organizations spend 15-20x more on inference over the model’s production lifetime.

If training is writing the textbook, inference is taking the exam. The preparation is over. Now you perform.

The Two Phases of LLM Inference

For transformer-based language models, inference breaks into two distinct phases.

Prefill processes the entire input prompt in parallel. All input tokens pass through the transformer layers simultaneously. The model builds a KV cache (key-value cache), storing intermediate attention results so they don’t need to be recomputed later. This phase is compute-bound: it does a lot of math on the GPU, processing all tokens at once. Prefill determines your time to first token (TTFT), which is how long you wait before the model starts responding.

Decode generates output tokens one at a time. Each new token depends on all previous tokens (both the original prompt and every token generated so far). The model reads from the KV cache to avoid reprocessing the full sequence, then produces one token, appends it, and repeats. This phase is memory-bound: the bottleneck is reading the growing KV cache from GPU memory, not raw computation. Decode determines your tokens-per-second throughput, which is how fast the response streams to you.

The distinction matters for optimization. Speeding up prefill and speeding up decode require different strategies because they hit different hardware limits.

The KV Cache

The KV cache is the single most important data structure in LLM inference. During the attention mechanism, each token produces a key and a value vector. These vectors are needed every time a new token is generated, because each token attends to all previous tokens. Without caching, the model would recompute attention for the entire sequence at every step. For a 1,000-token response, that means reprocessing all previous tokens 1,000 times.

The KV cache stores these vectors so they’re computed once and reused. The tradeoff is memory. For a large model with a long context, the KV cache can consume gigabytes of GPU memory. A 70B parameter model processing a 32K-token sequence might need 10-20GB just for the KV cache, on top of the memory for the model weights themselves.

This is why context window length affects performance. Longer contexts mean larger KV caches, which means more memory and slower decode steps as the model reads through more cached data.

Key Performance Metrics

Time to first token (TTFT): How long after sending a prompt before the first token appears. Determined by prefill speed. Users notice this as the initial delay before a response starts streaming. For interactive applications, 200-500ms is good. Over 2 seconds feels slow.

Tokens per second (TPS): How fast tokens generate during the decode phase. This determines how quickly a response completes. Typical ranges: 5-15 TPS on CPU, 30-80 TPS on a single consumer GPU, 100+ TPS on datacenter GPUs.

Throughput: Total tokens generated per second across all concurrent requests. This is the metric that matters for serving many users. A system generating 50 TPS for one user might generate 500 TPS total when batching 20 concurrent requests.

Latency: End-to-end time from request to complete response. A function of TTFT plus total tokens divided by TPS.

Optimization Techniques

Continuous Batching

Naive batching groups requests together and processes them as a unit, but all requests in the batch must wait for the longest one to finish. Continuous batching dynamically adds and removes requests mid-inference. When a short response finishes, its slot immediately goes to a new request. This eliminates idle GPU cycles and can increase throughput 2-4x compared to static batching.

PagedAttention

Traditional KV cache allocation pre-reserves contiguous GPU memory for the maximum possible sequence length, even if most of that memory goes unused. PagedAttention (introduced by vLLM) borrows the concept of virtual memory from operating systems: it divides the KV cache into fixed-size blocks that can be stored in non-contiguous memory and mapped via block tables. This reduces memory waste from 60-80% down to under 5%, allowing more concurrent requests on the same hardware.

Speculative Decoding

A small, fast “draft” model generates several tokens quickly. The larger target model then verifies those tokens in parallel (verification is cheaper than generation because it can process all draft tokens at once, like prefill). If the draft tokens are correct, you’ve generated multiple tokens in the time it would take to generate one. If some are wrong, the target model corrects from the first mismatch. The speedup depends on how well the draft model predicts what the target model would say, typically 2-3x for well-matched pairs.

Quantization

Reducing the precision of model weights from 16-bit to 8-bit or 4-bit makes inference faster and uses less memory. A 7B model at 16-bit needs about 14GB; at 4-bit, about 4GB. The quality loss at 4-bit is modest for most tasks (roughly 5-10%). Quantization is the single biggest lever for running models on consumer hardware.

Inference Serving Tools

Ollama is the simplest path for local inference. It manages model downloads, handles quantization, and exposes a local API. One command gets you a running model. Best for development, experimentation, and single-user use.

vLLM is a production inference engine with continuous batching and PagedAttention built in. It achieves up to 24x higher throughput than HuggingFace Transformers for serving. If you’re serving a model to multiple users, vLLM is the standard choice.

TensorRT-LLM is NVIDIA’s optimized inference library. It applies GPU-specific optimizations (kernel fusion, quantization-aware compilation) for maximum performance on NVIDIA hardware. Highest performance ceiling, but vendor-locked.

llama.cpp runs models on CPUs and Apple Silicon with minimal dependencies. It powers Ollama under the hood and is the go-to for running models on machines without NVIDIA GPUs.

Cost Drivers

Inference cost comes down to three factors: hardware utilization, model size, and request volume.

Hardware utilization is why batching and memory optimization matter. A GPU sitting idle between requests costs the same as one at full utilization. Better batching means more tokens per dollar.

Model size directly affects cost. A 70B model needs roughly 10x the compute of a 7B model per token. If a 7B model handles your task adequately, running the 70B version wastes money. Choosing the right model size for the task is the most impactful cost decision.

Request volume determines whether self-hosting pays off. At low volume, API pricing is cheaper because you don’t pay for idle hardware. At high volume, the per-token cost of self-hosted inference drops below API pricing because your hardware runs at high utilization.

Self-Hosted vs. API Inference

API inference (OpenAI, Anthropic, Google) gives you access to frontier models with zero infrastructure. You pay per token. Scaling is handled for you. The trade-offs: your data goes to a third party, you depend on their uptime, and costs are linear with usage.

Self-hosted inference gives you full control. Your data stays on your infrastructure. Per-token cost decreases as utilization increases. The trade-offs: you manage the hardware, handle scaling, and the quality ceiling is lower than frontier APIs (since you’re running open-weight models).

The practical split for most teams: use APIs for tasks requiring maximum quality (complex reasoning, nuanced writing) and self-hosted models for high-volume, privacy-sensitive, or latency-critical tasks where a smaller model performs adequately.

Why Inference Matters for AI Engineers

Understanding inference mechanics changes how you build AI systems. When you know that decode is memory-bound, you understand why shorter prompts produce faster responses, not just because there’s less to process, but because the KV cache is smaller. When you know that batching exists, you understand why API latency spikes during peak hours. When you know that quantization trades precision for speed, you can make informed decisions about which tradeoff fits your application.

The model is only half the system. The other half is how efficiently you run it. Two teams using the same model can have 10x different costs and latencies depending on their inference setup. For a deeper look at building efficient AI systems, the book covers inference pipelines, optimization strategies, and the full stack from model selection to production deployment.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading