Skip to content

[Phase 4] Optimize Context Assembly Order for LLM Cache Hits #63

@frankbria

Description

@frankbria

Summary

Audit and optimize context assembly order to maximize LLM provider cache hits, potentially achieving up to 4x cost reduction through prompt caching.

Background: State of the Art

From Philipp Schmid's 5 Practical Tips for Context Engineering:

"Context Ordering Matters: Try to use 'append-only' context, adding new information to the end. This maximizes cache hits reducing cost (4x) and latency."

LLM providers (Anthropic, OpenAI) implement prompt caching where repeated prefixes are cached. If your context window looks like:

[System Prompt] + [Project Context] + [Task History] + [Current Request]

And only [Current Request] changes between calls, the prefix can be cached. But if you reorder or modify earlier sections, the cache is invalidated.

Key principle: Static content first, dynamic content last.

Current State in CodeFRAME

The tiered memory system assembles context, but the ordering of that assembly is unclear:

  • Is system prompt consistently first?
  • Does tier promotion/demotion cause reordering?
  • Are tool definitions stable in position?
  • Is task-specific context appended at the end?

With Claude API's prompt caching (available since late 2024), improper ordering directly impacts costs.

Investigation Tasks

  1. Map current context assembly order

    • Document the exact sequence: system prompt → X → Y → Z → user message
    • Identify what components are static vs. dynamic per-call
    • Check if tier changes cause mid-context insertions
  2. Identify cache-breaking patterns

    • Log context assembly across multiple agent calls in a session
    • Diff consecutive prompts to find what's changing and where
    • Quantify how much prefix is stable vs. changing
  3. Implement append-only assembly

    • Restructure context builder to enforce: static_prefix + append_only_dynamic
    • Move all changing content to end of context
    • Ensure tool definitions don't shift position
  4. Measure cache hit rates

    • Enable cache metrics from Anthropic API (if using Claude)
    • Compare before/after cost and latency

Success Criteria

  • Documented context assembly sequence
  • Identified cache-breaking patterns in current implementation
  • Refactored to append-only pattern (if beneficial)
  • Measured improvement in cache hit rate and cost reduction

Cost Impact Estimate

If CodeFRAME averages 50 LLM calls per task with 10K tokens of static context:

  • Without caching: 50 × 10K = 500K input tokens billed
  • With caching: 10K + (49 × cached rate) = potentially 75% reduction
  • At Claude Sonnet rates: significant $ savings at scale

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    FutureDeferred - beyond v1/v2 scope, consider for future versionscontext-engineeringContext window management and optimizationenhancementNew feature or requestmonitoringpriority:high

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions