Skip to main content
← All posts
4 min read

Prompt Caching: The Cost Math Most Teams Get Wrong

Prompt caching is not a 90% discount. It's a 90% discount on the static parts only. Here's how to actually compute your cache savings.

Share

You enabled prompt caching. The dashboard shows "75% cache hit rate." You expected your bill to drop 75%. It dropped 12%.

This is normal. Prompt caching does not work the way most teams think. Here's what's actually happening, and how to design for real savings.

What prompt caching actually charges

Anthropic's pricing for cached vs. uncached input:

  • Cache write: 1.25x the base input cost (you pay extra to put it in the cache)
  • Cache read: 0.1x the base input cost (90% discount)
  • Output: unchanged (caching is input-only)

So a cached prompt isn't free. The first call to populate the cache costs 1.25x. Subsequent reads cost 0.1x. The cache lives 5 minutes by default (or longer with extended TTL, at additional cost).

Where the math goes wrong

Most teams compute savings as:

"We have 75% cache hit rate, so we save 75% × 90% = 67.5%."

This treats every token as cacheable. It isn't. Your prompt has two parts:

  1. The cacheable prefix — system prompt, tool definitions, retrieved context, examples
  2. The variable suffix — user message, conversation history

Only the prefix is cached. The suffix is always full price.

Real math:

total_cost = (prefix_tokens × cache_read_rate × hit_rate)
           + (prefix_tokens × cache_write_rate × (1 - hit_rate))
           + (suffix_tokens × base_input_rate)
           + (output_tokens × output_rate)

If your prefix is 1k tokens and your suffix is 5k tokens (typical for a chat with history), the suffix dominates. Caching saves nothing on it.

Real example: an agent loop

A coding agent has:

  • System prompt: 800 tokens (cacheable)
  • Tool definitions: 2000 tokens (cacheable)
  • Retrieved file context: 8000 tokens (cacheable per turn — varies but stable for a few turns)
  • Conversation history: grows from 0 to 50k tokens
  • User message: 200 tokens

For Claude Sonnet 4.6 (~$3/MTok input, ~$15/MTok output):

Without caching, 10-turn conversation:

  • Per turn input: 800 + 2000 + 8000 + (history grows) + 200
  • Total input tokens across 10 turns: ~300k
  • Cost: $0.90 input + output

With caching (cache prefix = system + tools + context = 10800 tokens):

  • Cache write on turn 1: 10800 × $3.75/MTok = $0.04
  • Cache read on turns 2-10: 9 × 10800 × $0.30/MTok = $0.029
  • Conversation history (uncached, grows): ~$0.4
  • Total: ~$0.47, saves 48%

Not 90%. Not 75%. About half. Still very worth it. But not what the marketing said.

How to maximize caching ROI

1. Cache aggressively at the front. Put everything stable into the cacheable prefix. System prompt, tools, examples, retrieved docs that don't change in this session.

2. Order matters. Caching is prefix-based. The cache hit only works if everything up to the cache breakpoint is byte-identical. One whitespace change invalidates it.

# WRONG - dynamic content interleaved
messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": f"Today is {date}. Help with: {question}"},
]

# RIGHT - static prefix, dynamic at the end
messages = [
    {"role": "system", "content": SYSTEM, "cache_control": {"type": "ephemeral"}},
    {"role": "user", "content": f"Help with: {question}\n\n(Today: {date})"},
]

3. Use multiple cache breakpoints. Anthropic supports up to 4 breakpoints. Use them: one after system, one after tools, one after retrieved docs. Even partial cache hits save money.

4. Don't cache things that change. A 50k-token document that you only use once isn't worth caching — you'll pay 1.25x and never read it.

5. Watch the 5-minute TTL. If your traffic is bursty, the cache expires between bursts. Either keep traffic warm or pay for extended TTL.

When caching actually delivers 90%

Single-turn batch jobs over the same context. Example: classifying 10k documents using the same system prompt.

  • First request: $0.04 cache write
  • Next 9999 requests: $0.0003 cache read each
  • Total: $3 instead of $30

This is the use case that gets the marketing numbers.

The takeaway

Prompt caching is essential. But it's not the 90% discount it sounds like. Compute your actual savings:

savings_ratio = (prefix_tokens / total_input_tokens) × 0.9 × cache_hit_rate

For most agent loops, that's 30-60%. Architect your prompts to push that as high as you can. And don't tell the CFO you'll save 90% — you won't.

Work with me

I consult with engineering teams on AI adoption, cloud architecture, and engineering effectiveness. If this post surfaced a challenge you're facing, let's talk.

Get in touch →

Explore more on these topics:

Subscribe to new posts

Get an email when I publish something new. No spam, unsubscribe any time.