Ai Coding 3 min read

Claude Code Retrospective Details 5x Drop in Session Costs

Anthropic's new technical retrospective reveals that prompt caching and prefix compaction act as strict architectural constraints for complex agentic workflows.

On April 30, Anthropic published a technical retrospective on Claude Code detailing the architecture of long-context agentic workflows. The report states that prompt caching operates as a rigid system constraint for production deployment. Without prefix sharing, a heavy coding session involving 50 to 200 turns costs up to $100 in API credits. By achieving a 90% cache hit rate, the exact same session drops to between $10 and $19.

Compaction and Prefix Rules

To guarantee high cache hit rates, Anthropic enforces a strict prompt ordering rule. Static content must appear first, with dynamic content pushed to the end. System prompts, tool definitions like bash and file edit, and project documentation in CLAUDE.md must sit at the top of the context window.

The engineering team monitors cache hit drops as service emergencies. Injecting a timestamp or adding a new tool mid-session instantly invalidates the prefix, spiking costs by 5x to 10x per turn.

Managing a 1M+ token context window requires summarizing history without breaking the existing cache state. Claude Code appends a compaction instruction as a new message at the end of the existing sequence. This prevents a full cache miss and allows the summary generation itself to leverage the previously cached context.

Economics and Performance Limits

Cached read tokens cost roughly 10% of standard input processing. Latency for cached requests drops by up to 85%. Processing a 100K-token codebase falls from 11.5 seconds to 2.4 seconds when the prefix is preserved. Users can audit this hit rate directly in the terminal via the /cost command. If you need to reduce LLM API costs across multi-agent systems, structuring prompts for maximum prefix reuse is required.

Cache state persistence relies on strict time-to-live (TTL) limits. Pro and pay-as-you-go developers operate with a 5-minute TTL, while Max plan users receive a 1-hour TTL. Developers pausing work must send heartbeat messages to reset the timer and prevent a full-context reload.

Ecosystem Updates

The caching architecture stabilizes the tool following the April 16 rollout of Claude Opus 4.7. The update raised Claude Code’s default reasoning effort to “xhigh” for software engineering tasks. Concurrently, Anthropic resolved a bug that aggressively cleared thinking history and reverted a system instruction that inadvertently limited response length.

Version 2.1.122 transitioned the client to a native binary on April 29, dropping the bundled JavaScript execution model. The update also added support for Amazon Bedrock service tiers.

If you rely on context engineering for stateful applications, isolate your static system instructions from dynamic user inputs. Treat any mid-session parameter change as a full cache miss, and structure your loops to preserve prefix continuity throughout the execution.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading