How to Reduce LLM API Costs in Production
LLM API costs add up fast in production. Here are the practical strategies that work: prompt caching, model routing, batching, output limits, and cost-per-task tracking.
LLM API costs are straightforward to understand and easy to lose control of. You pay per token, input and output. A single GPT-5.4 request might cost fractions of a cent. A million requests per day at enterprise scale can cost thousands of dollars, and most of that spend is avoidable with the right architecture.
The goal isn’t to minimize cost at the expense of quality. It’s to stop paying for work the model doesn’t need to do.
Understand Where the Money Goes
Before optimizing, measure. Every LLM request has two cost components:
- Input tokens: your system prompt, conversation history, retrieved context, and the user’s message
- Output tokens: the model’s response
Output tokens are typically 3-5x more expensive than input tokens. A verbose response costs more than a concise one, even if the input is identical. This asymmetry is the first lever you should pull.
Track cost per request, cost per user, and cost per task. Aggregate numbers hide the outliers that are often responsible for most of your spend. A single RAG pipeline that stuffs 10,000 tokens of context into every request might account for 40% of your bill.
Prompt Caching
Both OpenAI and Anthropic offer prompt caching, which reduces the cost of repeated prompt prefixes. If your system prompt and few-shot examples are the same across requests (which they usually are), the provider caches the processed prefix and charges a reduced rate for subsequent requests that reuse it.
With OpenAI, caching happens automatically for prompts with shared prefixes longer than 1,024 tokens. Cached input tokens cost 50% less. With Anthropic, you can explicitly mark cache breakpoints in your prompt. Cache reads cost 90% less than standard input tokens, though the initial cache write costs 1.25x the base rate for a 5-minute TTL (or 2x for a 1-hour TTL). The savings compound quickly after the first request.
The impact is substantial for applications with long system prompts, few-shot examples, or shared context blocks. If your system prompt is 2,000 tokens and you make 100,000 requests per day, prompt caching alone can cut your input cost by 50-90% on that prefix.
Structure your prompts so the static portion comes first and the variable portion (user message, retrieved context) comes last. This maximizes the cacheable prefix.
Model Routing
Not every request needs your most capable (and most expensive) model. A simple classification question doesn’t need GPT-5.4. A straightforward extraction task doesn’t need Claude Opus.
Model routing directs each request to the cheapest model that can handle it:
| Task complexity | Model tier | Typical cost reduction |
|---|---|---|
| Simple classification, extraction, formatting | Small model (GPT-5.4 Mini, Claude Haiku) | 10-20x cheaper |
| Standard generation, summarization, Q&A | Mid-tier model (GPT-5.4, Claude Sonnet) | Baseline |
| Complex reasoning, multi-step planning, code generation | Frontier model (GPT-5.4 Thinking, Claude Opus) | Most expensive |
The routing decision can be rule-based (route by task type) or model-based (use a cheap classifier to estimate difficulty, then route accordingly). Start with rules. A task taxonomy that maps request types to model tiers captures 80% of the savings with minimal complexity.
For a deeper comparison of model capabilities and pricing, see GPT vs Claude vs Gemini.
Control Output Length
Output tokens cost more than input tokens, and models tend to be verbose by default. Two simple controls reduce output cost:
max_tokens: Set an explicit ceiling. If you need a one-sentence summary, cap output at 100 tokens. Without a limit, the model might generate 500 tokens of elaboration you’ll throw away.
Prompt instructions: Tell the model to be concise. “Respond in one sentence” or “Keep your response under 50 words” works surprisingly well. Combine this with max_tokens as a hard backstop.
For structured output use cases where you’re extracting data into a fixed schema, the output size is naturally bounded by the schema. These tend to be the most cost-efficient LLM tasks.
Batching
If your workload includes bulk processing (embedding documents, classifying a backlog of tickets, generating summaries for a dataset), use batch APIs. OpenAI’s Batch API processes requests asynchronously at 50% of the standard cost, with results delivered within 24 hours.
Batching is ideal for:
- Nightly data processing jobs
- Backfill operations
- Evaluation runs
- Any workflow where latency doesn’t matter
Don’t batch user-facing requests. The latency tradeoff only works for background workloads.
Reduce Context Size
Every token of context you send costs money. RAG applications that retrieve five chunks of 500 tokens each add 2,500 tokens of context per request. If only one of those chunks is relevant, you’re paying for 2,000 tokens of noise.
Improve retrieval precision first. Better embeddings, better chunking, and re-ranking retrieved results before sending them to the model all reduce wasted context.
Compress conversation history. Instead of sending the full conversation, summarize earlier turns. A 20-turn conversation can be compressed to a 200-token summary of key points plus the last 2-3 turns, cutting context by 80%.
Trim your system prompt. Many system prompts accumulate instructions over time and become bloated. Review yours quarterly. Every removed sentence saves tokens across every request.
Semantic Caching
If users ask similar questions repeatedly, you can cache responses and serve them without making an API call at all. Semantic caching uses embeddings to match incoming queries against previously answered ones. If a new query is semantically similar enough to a cached query (above a configurable threshold), return the cached response.
This works well for support bots, FAQ systems, and any application with repetitive query patterns. It doesn’t work for conversations that depend heavily on unique context or real-time data.
The cache hit rate determines the value. A 30% hit rate on a high-volume application means 30% fewer API calls. Measure it before committing to the infrastructure.
Track Cost Per Task
The most important optimization is visibility. If you don’t know what each task costs, you can’t make informed tradeoffs.
Log every request with:
- Model used
- Input tokens and output tokens
- Cost (calculated from token counts and pricing)
- Task type or feature area
- User or tenant ID
Aggregate by task type weekly. You’ll find that a small number of task types account for most of your spend, and those are where optimization efforts should focus.
Set budget alerts per task type. If a prompt change accidentally doubles your context size, you want to know within hours, not at the end of the month when the invoice arrives. For broader monitoring patterns, LLM observability covers the full picture.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Google DeepMind Unveils AGI Cognitive Evaluation Framework and Launches $200,000 Kaggle Hackathon
Google DeepMind introduced a 10-faculty framework for measuring AGI progress and opened a $200,000 Kaggle evaluation hackathon.
LLM Observability: How to Monitor AI Applications
Traditional monitoring doesn't cover LLM applications. Here's what to log, how to trace multi-step chains, and how to detect quality regressions before users do.
How to Stream LLM Responses in Your Application
Streaming LLM responses reduces perceived latency and improves UX. Here's how server-sent events work, how to implement streaming with OpenAI and Anthropic, and what to watch for in production.
How to Evaluate AI Output (LLM-as-Judge Explained)
Traditional tests don't work for AI output. Here's how to evaluate quality using LLM-as-judge, automated checks, human review, and continuous evaluation frameworks.
How to Choose a Vector Database in 2026
Pinecone, Weaviate, Qdrant, pgvector, or Chroma? Here's how to pick the right vector database for your AI application based on scale, infrastructure, and actual needs.