Ai Engineering 6 min read

How to Reduce LLM API Costs in Production

LLM API costs add up fast in production. Here are the practical strategies that work: prompt caching, model routing, batching, output limits, and cost-per-task tracking.

LLM API costs are straightforward to understand and easy to lose control of. You pay per token, input and output. A single GPT-5.4 request might cost fractions of a cent. A million requests per day at enterprise scale can cost thousands of dollars, and most of that spend is avoidable with the right architecture.

The goal isn’t to minimize cost at the expense of quality. It’s to stop paying for work the model doesn’t need to do.

Understand Where the Money Goes

Before optimizing, measure. Every LLM request has two cost components:

  • Input tokens: your system prompt, conversation history, retrieved context, and the user’s message
  • Output tokens: the model’s response

Output tokens are typically 3-5x more expensive than input tokens. A verbose response costs more than a concise one, even if the input is identical. This asymmetry is the first lever you should pull.

Track cost per request, cost per user, and cost per task. Aggregate numbers hide the outliers that are often responsible for most of your spend. A single RAG pipeline that stuffs 10,000 tokens of context into every request might account for 40% of your bill.

Prompt Caching

Both OpenAI and Anthropic offer prompt caching, which reduces the cost of repeated prompt prefixes. If your system prompt and few-shot examples are the same across requests (which they usually are), the provider caches the processed prefix and charges a reduced rate for subsequent requests that reuse it.

With OpenAI, caching happens automatically for prompts with shared prefixes longer than 1,024 tokens. Cached input tokens cost 50% less. With Anthropic, you can explicitly mark cache breakpoints in your prompt. Cache reads cost 90% less than standard input tokens, though the initial cache write costs 1.25x the base rate for a 5-minute TTL (or 2x for a 1-hour TTL). The savings compound quickly after the first request.

The impact is substantial for applications with long system prompts, few-shot examples, or shared context blocks. If your system prompt is 2,000 tokens and you make 100,000 requests per day, prompt caching alone can cut your input cost by 50-90% on that prefix.

Structure your prompts so the static portion comes first and the variable portion (user message, retrieved context) comes last. This maximizes the cacheable prefix.

Model Routing

Not every request needs your most capable (and most expensive) model. A simple classification question doesn’t need GPT-5.4. A straightforward extraction task doesn’t need Claude Opus.

Model routing directs each request to the cheapest model that can handle it:

Task complexityModel tierTypical cost reduction
Simple classification, extraction, formattingSmall model (GPT-5.4 Mini, Claude Haiku)10-20x cheaper
Standard generation, summarization, Q&AMid-tier model (GPT-5.4, Claude Sonnet)Baseline
Complex reasoning, multi-step planning, code generationFrontier model (GPT-5.4 Thinking, Claude Opus)Most expensive

The routing decision can be rule-based (route by task type) or model-based (use a cheap classifier to estimate difficulty, then route accordingly). Start with rules. A task taxonomy that maps request types to model tiers captures 80% of the savings with minimal complexity.

For a deeper comparison of model capabilities and pricing, see GPT vs Claude vs Gemini.

Control Output Length

Output tokens cost more than input tokens, and models tend to be verbose by default. Two simple controls reduce output cost:

max_tokens: Set an explicit ceiling. If you need a one-sentence summary, cap output at 100 tokens. Without a limit, the model might generate 500 tokens of elaboration you’ll throw away.

Prompt instructions: Tell the model to be concise. “Respond in one sentence” or “Keep your response under 50 words” works surprisingly well. Combine this with max_tokens as a hard backstop.

For structured output use cases where you’re extracting data into a fixed schema, the output size is naturally bounded by the schema. These tend to be the most cost-efficient LLM tasks.

Batching

If your workload includes bulk processing (embedding documents, classifying a backlog of tickets, generating summaries for a dataset), use batch APIs. OpenAI’s Batch API processes requests asynchronously at 50% of the standard cost, with results delivered within 24 hours.

Batching is ideal for:

  • Nightly data processing jobs
  • Backfill operations
  • Evaluation runs
  • Any workflow where latency doesn’t matter

Don’t batch user-facing requests. The latency tradeoff only works for background workloads.

Reduce Context Size

Every token of context you send costs money. RAG applications that retrieve five chunks of 500 tokens each add 2,500 tokens of context per request. If only one of those chunks is relevant, you’re paying for 2,000 tokens of noise.

Improve retrieval precision first. Better embeddings, better chunking, and re-ranking retrieved results before sending them to the model all reduce wasted context.

Compress conversation history. Instead of sending the full conversation, summarize earlier turns. A 20-turn conversation can be compressed to a 200-token summary of key points plus the last 2-3 turns, cutting context by 80%.

Trim your system prompt. Many system prompts accumulate instructions over time and become bloated. Review yours quarterly. Every removed sentence saves tokens across every request.

Semantic Caching

If users ask similar questions repeatedly, you can cache responses and serve them without making an API call at all. Semantic caching uses embeddings to match incoming queries against previously answered ones. If a new query is semantically similar enough to a cached query (above a configurable threshold), return the cached response.

This works well for support bots, FAQ systems, and any application with repetitive query patterns. It doesn’t work for conversations that depend heavily on unique context or real-time data.

The cache hit rate determines the value. A 30% hit rate on a high-volume application means 30% fewer API calls. Measure it before committing to the infrastructure.

Track Cost Per Task

The most important optimization is visibility. If you don’t know what each task costs, you can’t make informed tradeoffs.

Log every request with:

  • Model used
  • Input tokens and output tokens
  • Cost (calculated from token counts and pricing)
  • Task type or feature area
  • User or tenant ID

Aggregate by task type weekly. You’ll find that a small number of task types account for most of your spend, and those are where optimization efforts should focus.

Set budget alerts per task type. If a prompt change accidentally doubles your context size, you want to know within hours, not at the end of the month when the invoice arrives. For broader monitoring patterns, LLM observability covers the full picture.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading