I Tested LLM Prompt Caching With Anthropic and OpenAI

Exploring Ideas: A Blog on Technology, Startups, Food, and More

I spent $1.00 and a Saturday morning testing prompt caching with Anthropic and OpenAI. Here’s what actually happened when I measured cache hits, token counts, and real costs.

The Setup

Prompt caching lets you reuse parts of your LLM prompts across multiple API calls. Instead of the model reprocessing the same 10,000-token document every time, it caches the internal computation and retrieves it on subsequent calls. The promise: 50-90% cost savings.

Anthropic and OpenAI handle this very differently:

Anthropic Claude: Explicit cache control markers (you decide what to cache)
OpenAI GPT: Completely automatic (zero configuration)

I wanted to know which one actually delivered.

How Caching Actually Works

Before diving into the tests, here’s what’s happening under the hood.

When a transformer processes text, it computes three things for each token: queries (Q), keys (K), and values (V). The attention mechanism uses these to figure out which tokens to pay attention to.

The expensive part: for every new token, the model has to compute attention against all previous tokens. For a 10,000 token prompt, that’s 10,000 key-value computations. Generate 100 tokens? You’re recomputing those same 10,000 K-V pairs 100 times.

This is wasteful. The keys and values for your prompt don’t change between calls.

KV caching stores those key-value pairs so they don’t need to be recomputed. Instead of:

Token 1: Compute K-V for tokens 1
Token 2: Compute K-V for tokens 1, 2
Token 3: Compute K-V for tokens 1, 2, 3
...

You do:

Token 1: Compute K-V for token 1, cache it
Token 2: Retrieve cached K-V for token 1, compute K-V for token 2, cache it
Token 3: Retrieve cached K-V for tokens 1-2, compute K-V for token 3, cache it
...

The math: Without caching, generating N tokens requires O(N²) attention computations. With caching, it’s O(N). For long sequences, this is huge.

Prompt caching takes this further. Instead of caching just within one request, it caches the K-V pairs across API calls. Your 10,000-token document? Compute its K-V pairs once, store them server-side, reuse them for every subsequent request.

The cost savings come from skipping this computation. LLM pricing is usually per-token processed. If the model doesn’t need to reprocess those tokens (just retrieves their cached K-V pairs), you pay less.

The catch: the cache is in memory. Storage isn’t free. That’s why there are TTLs (time-to-live) and minimum token thresholds.

Anthropic Claude: The Control Freak

Anthropic gives you the most control. You explicitly mark what to cache using cache_control parameters:

response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[{
        "type": "text",
        "text": large_document,  # 6,000+ tokens
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "Summarize this"}]
)

First request (cache creation):

Cache created: 6,313 tokens
Cost: $0.023674 (includes 25% write surcharge)
Latency: 4.71 seconds

Second request (cache hit):

Cache retrieved: 6,313 tokens
Cost: $0.001894 (90% discount!)
Latency: 5.48 seconds

Wait, the cached version was slower? Network variance. Don’t count on caching for speed improvements. The real win is cost.

The 90% savings is real though. Cache reads cost $0.30 per million tokens vs $3.00 for regular input. But there’s a 25% surcharge on cache writes ($3.75/M), so you want at least 2-3 cache reuses to break even.

The Invalidation Test

I modified one character in the cached content. The cache instantly invalidated, created a new cache, and charged me the write surcharge again. Token-level exact matching is strict.

Multiple Cache Breakpoints

Anthropic supports up to 4 cache boundaries per request. I tested two:

system=[
    {"type": "text", "text": "System instructions",
     "cache_control": {"type": "ephemeral"}},
    {"type": "text", "text": "Large document",
     "cache_control": {"type": "ephemeral"}}
]

It worked perfectly. You can cache your system prompt separately from your document context, which is great for RAG applications where the system stays constant but documents change.

OpenAI GPT: The Invisible Helper

OpenAI’s approach is radically different: no configuration at all.

Send a prompt over 1,024 tokens, and caching just happens automatically. The API reports cache hits via usage.prompt_tokens_details.cached_tokens.

Test with ~4,500 tokens:

First request:

No explicit cache indication
Response time: 2.17 seconds

Second request (30 seconds later):

Still no cached tokens reported
Response time: 1.85 seconds (15% faster)

The speed improvement suggests caching worked, but the metrics didn’t confirm it. This happened several times in my tests.

Test with different prompt, same prefix:

I sent a new query with the same 1,900-token system prompt:

Cached: 1,920 tokens reported
50% discount confirmed
Cache persisted after 30+ seconds

The caching worked, but the observability was inconsistent. Sometimes I’d get explicit cache metrics, sometimes I wouldn’t, even when latency improved.

The Prefix Flexibility Surprise

OpenAI’s docs claim “exact prefix match” is required. My tests showed otherwise.

I intentionally modified the cached prefix by appending content, expecting a cache miss. Instead: 1,920 tokens cached. The system cached the longest matching substring, not just exact prefixes.

This is actually useful but confusingly (to me at least) documented.

The Cost Comparison

Let’s say you process 100,000 tokens of context with 1,000 tokens of unique query, repeated 10 times:

Anthropic Claude Sonnet 4.5:

First request: $0.39 (cache write + query)
Next 9 requests: $0.04 each
Total: $0.75
Savings vs no caching: 75.9%

OpenAI GPT-4o:

First request: $0.26 (no write surcharge!)
Next 9 requests: $0.13 each
Total: $1.45
Savings vs no caching: 53.4%

Anthropic is cheaper per cached call but has that write surcharge. OpenAI costs more per call but simpler pricing.

What Actually Matters

After running these tests, here’s what I learned:

1. Cache Hit Rate Is Everything

The theoretical savings don’t matter if you’re not getting cache hits. Design your prompts with stable prefixes:

Good:

system_prompt = "You are an expert..."  # Static
document = load_document()              # Static
user_query = get_user_input()          # Dynamic

Bad:

prompt = f"Today is {datetime.now()}..."  # Changes every call

2. Minimum Token Thresholds Are Real

Both providers require minimums:

Anthropic: 1,024 tokens (Claude Sonnet 4.5)
OpenAI: 1,024 tokens

Small prompts don’t cache. Period.

3. Don’t Expect Speed Improvements

My Anthropic cache hit was actually slower than the cache miss (5.48s vs 4.71s). Network variance is real. The primary benefit is cost reduction, not latency.

4. TTL Is Shorter Than You Think

Default cache durations:

Anthropic: 5 minutes
OpenAI: 5-10 minutes

If your use case has queries more than 10 minutes apart, you might not benefit from caching at all.

Which One Should You Use?

Choose Anthropic if:

You want maximum cost savings (90% on cache reads)
You have complex prompts with multiple stable sections
You’re willing to add explicit cache markers
You’ll reuse caches 2+ times (to overcome write surcharge)

Choose OpenAI if:

You want zero configuration
You’re prototyping or building MVPs
50% savings is good enough
You value simplicity over control

A Note on Google Gemini

Google also offers caching with two modes (explicit and implicit), longer TTLs (up to 60 minutes), and multi-modal support (images, PDFs, video). They claim very low costs but I haven’t tested it yet. If you’re considering Gemini, run your own experiments! The documentation looks promising but always verify these things.

The Bottom Line

Prompt caching works and saves real money.

Both require careful prompt design to maximize cache hits. And don’t expect speed improvements, cost reduction is the real win.

For prototyping, I’d use OpenAI’s automatic caching. For high-volume production with stable prompts, Anthropic’s granular control and 90% savings are worth the added complexity.

Just don’t forget those minimum token thresholds. And watch your cache hit rates in production, if you can you should actually monitor this: theoretical savings mean nothing if you’re not actually getting cache reuse.

Subscribe to the Newsletter

Get the latest posts and insights delivered straight to your inbox.