-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary
Audit and optimize context assembly order to maximize LLM provider cache hits, potentially achieving up to 4x cost reduction through prompt caching.
Background: State of the Art
From Philipp Schmid's 5 Practical Tips for Context Engineering:
"Context Ordering Matters: Try to use 'append-only' context, adding new information to the end. This maximizes cache hits reducing cost (4x) and latency."
LLM providers (Anthropic, OpenAI) implement prompt caching where repeated prefixes are cached. If your context window looks like:
[System Prompt] + [Project Context] + [Task History] + [Current Request]
And only [Current Request] changes between calls, the prefix can be cached. But if you reorder or modify earlier sections, the cache is invalidated.
Key principle: Static content first, dynamic content last.
Current State in CodeFRAME
The tiered memory system assembles context, but the ordering of that assembly is unclear:
- Is system prompt consistently first?
- Does tier promotion/demotion cause reordering?
- Are tool definitions stable in position?
- Is task-specific context appended at the end?
With Claude API's prompt caching (available since late 2024), improper ordering directly impacts costs.
Investigation Tasks
-
Map current context assembly order
- Document the exact sequence: system prompt → X → Y → Z → user message
- Identify what components are static vs. dynamic per-call
- Check if tier changes cause mid-context insertions
-
Identify cache-breaking patterns
- Log context assembly across multiple agent calls in a session
- Diff consecutive prompts to find what's changing and where
- Quantify how much prefix is stable vs. changing
-
Implement append-only assembly
- Restructure context builder to enforce:
static_prefix + append_only_dynamic - Move all changing content to end of context
- Ensure tool definitions don't shift position
- Restructure context builder to enforce:
-
Measure cache hit rates
- Enable cache metrics from Anthropic API (if using Claude)
- Compare before/after cost and latency
Success Criteria
- Documented context assembly sequence
- Identified cache-breaking patterns in current implementation
- Refactored to append-only pattern (if beneficial)
- Measured improvement in cache hit rate and cost reduction
Cost Impact Estimate
If CodeFRAME averages 50 LLM calls per task with 10K tokens of static context:
- Without caching: 50 × 10K = 500K input tokens billed
- With caching: 10K + (49 × cached rate) = potentially 75% reduction
- At Claude Sonnet rates: significant $ savings at scale
References
- Anthropic Prompt Caching Documentation
- Context Engineering Tips - Philipp Schmid
- CodeFRAME cost tracking already exists - leverage for measurement