Skip to content

Wire Anthropic prompt caching for system + task+plan messages #433

@lmorchard

Description

@lmorchard

Current state

Pilo's main action-loop LLM call (webAgent.ts:874-989) invokes streamText from the Vercel AI SDK with no provider-specific cache markers:

const streamResult = streamText({
  ...this.providerConfig,
  messages: this.messages,
  tools: webActionTools,
  toolChoice: "required",
  maxOutputTokens: DEFAULT_GENERATION_MAX_TOKENS,
  abortSignal: this.abortSignal,
});

The messages array contains, in this order:

  1. The system prompt (built by buildActionLoopSystemPrompt) — ~3000-4000 tokens including tool examples and best practices
  2. The task+plan user message (built by buildTaskAndPlanPrompt) — ~500-1500 tokens
  3. Per-step snapshot user messages, assistant turns, tool results, error feedback, validation feedback (the conversation)

For a 50-iteration task on Claude with no caching, the system prompt + task+plan messages (positions 1 and 2) are billed at full input rate 50 times.

Anthropic supports prompt caching via cache_control: { type: "ephemeral" } markers on individual content parts. The Vercel AI SDK surfaces this through providerOptions.anthropic on individual messages. Cached tokens are billed at ~10% of normal input cost on hit (default 5-minute TTL).

The gap

For Claude-based runs, Pilo currently pays full input cost on tokens that are stable across the entire run. On a long task with a 4000-token system prompt and 50 iterations, that's 200,000 tokens billed that could mostly be cache hits.

OpenAI's prompt caching is automatic (no markers needed) and already applies. Gemini's caching is structurally different and not addressed here. The win is specifically for Anthropic — and via OpenRouter routing to Anthropic models.

Proposed scope

A. Detect Anthropic-routed models

In provider.ts, add a helper to determine whether the active provider is using an Anthropic model (direct or via OpenRouter):

function isAnthropicModel(providerConfig: ProviderConfig): boolean {
  const modelId = providerConfig.model?.modelId ?? "";
  // Direct Anthropic provider
  if (providerConfig.providerOptions?.anthropic) return true;
  // OpenRouter routing to Anthropic
  if (/^anthropic\//.test(modelId)) return true;
  // Heuristic: model name contains "claude"
  if (/claude/i.test(modelId)) return true;
  return false;
}

B. Mark cacheable messages

In initializeSystemPromptAndTask (webAgent.ts:1641-1672), when Anthropic-routed, mark the system message and the task+plan user message as cacheable:

const cacheableMeta = isAnthropicModel(this.providerConfig)
  ? { providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } }
  : {};

this.messages = [
  { role: "system", content: systemPrompt, ...cacheableMeta },
  { role: "user", content: taskAndPlan, ...cacheableMeta },
];

Verify the exact key name (providerOptions vs experimental_providerMetadata) against the installed @ai-sdk/anthropic version in package.json.

C. Optionally mark the latest snapshot as cacheable too

A more aggressive optimization: each snapshot becomes cacheable up until the next snapshot. This makes the entire conversation prefix cache-hit up to the most recent assistant turn. Tradeoff: cache writes are slightly more expensive than reads, and snapshots churn (one new every iteration), so the cache might invalidate frequently. Benchmark before enabling.

D. Surface cache metrics

streamText returns usage and providerMetadata. On Anthropic, the usage includes cacheReadInputTokens and cacheCreationInputTokens. Surface these in the AI_GENERATION event:

this.eventEmitter.emit(WebAgentEventType.AI_GENERATION, {
  // ... existing fields ...
  cacheReadTokens: usage?.cacheReadInputTokens ?? 0,
  cacheWriteTokens: usage?.cacheCreationInputTokens ?? 0,
});

The eval-judge consumer (and any cost-tracking layer) can then compute per-task savings.

Implementation notes

  • The exact AI SDK syntax for cache markers varies between SDK versions. Verify against the version pinned in packages/core/package.json before writing the code.
  • Cache markers on the system message and the task+plan message should form a single contiguous cacheable prefix. But the SDK may treat them as two separate cache entries (one per message). Test by checking cacheReadInputTokens on the second iteration of a fresh task.
  • 5-minute TTL means: tasks that pause for >5 minutes between steps lose the cache. For typical browser-automation tasks (each step is 5-30 seconds), this isn't an issue.
  • The cache is per-account and per-content-prefix. The system prompt's currentDate field changes daily — so the cache invalidates each midnight. Acceptable; can be addressed by extracting the date to an early-but-not-cached user-message position if it becomes a real cost concern.
  • Don't add caching for non-Anthropic providers. OpenAI does its own automatic caching; Gemini doesn't support this style; Ollama/LM Studio don't either.

Acceptance criteria

  • For Anthropic-routed providers, system + task+plan messages carry cacheControl: { type: "ephemeral" }.
  • For non-Anthropic providers, no cache markers are added (verify by message inspection in tests).
  • The AI_GENERATION event includes cacheReadTokens / cacheWriteTokens (zero when no caching applies).
  • A manual smoke run on a 5-step Claude task shows cacheReadInputTokens > 0 on step 2+.
  • Tests in packages/core/test/ cover: cache-marker presence for Anthropic, absence for others, AI_GENERATION event field shape.

Effort estimate

1-2 days including verification against the SDK and the smoke test.

Related issues

Pairs with the per-model prompt variants issue — flash variants will have a shorter cacheable prefix, but the cache markers go on whichever variant is selected.

Files likely affected

  • packages/core/src/provider.ts (provider detection helper)
  • packages/core/src/webAgent.ts (initializeSystemPromptAndTask, AI_GENERATION event)
  • packages/core/src/events.ts (event field additions)
  • packages/core/test/webAgent.test.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions