Current state
Pilo's conversation history grows monotonically across iterations. The current trimming strategy (truncateOldExternalContent, webAgent.ts:743-774) clips the body of old <EXTERNAL-CONTENT> blocks before each new snapshot push. A separate Tier-2 issue ("aggressive history clipping for long runs") proposes extending this to clip old assistant tool-call messages, feedback messages, and tool results.
For typical 10-30 iteration tasks, that aggressive clipping is enough. But:
- Tasks that hit
maxIterations: 50 accumulate a lot of clipped-but-still-present content.
- The structure of the conversation (50 assistant turns, 50 tool results, all the intermediate noise) costs tokens even when content is clipped.
- The model's attention spread across 50+ messages is suboptimal — recent context gets less proportional weight.
A more aggressive approach: periodically summarize old history into a single compact block, then drop the underlying messages entirely.
The gap
This issue is the bigger-effort successor to the Tier-2 clipping issue. Land that one first; this one only becomes valuable if real workloads regularly exceed what clipping alone handles.
Reasons to do it:
- Long-running research tasks (e.g., a benchmark task with maxIterations=100).
- Tasks where each iteration adds substantial content (extract calls returning large markdown, search results, etc.).
- Pushing the long-tail context-window failures from "crashes opaquely after N iterations" to "soft-summarizes and continues."
Reasons to delay:
- Most current tasks don't hit this threshold.
- Adds complexity and an additional LLM dependency (the summarizer).
- Summary quality is unpredictable; a bad summary can make subsequent steps worse than the original messages would have.
Proposed scope
A. Compaction trigger
Two conditions, either fires:
- Step cadence: every K iterations (default K=25).
- Token threshold: estimated history token count > T (default T=40000).
B. Compaction strategy
Call a separate LLM (default: the same provider, possibly with a cheaper model class) with a tight summarization prompt:
You are summarizing an agent run for prompt compaction. The agent is a web-browsing
assistant working through a multi-step task. Summarize the conversation so the agent
can continue with much less history while keeping all task-relevant context.
Capture:
- Task requirements (re-state precisely)
- Key facts and data the agent has gathered (with verbatim values)
- Decisions the agent has made and why
- Partial progress (what's been completed)
- Errors encountered and how they were handled
- The current strategic situation
Preserve:
- Important entities, values, URLs, file paths, identifiers verbatim
- Anything the agent referred back to in recent steps
Critical rules:
- Only mark a step as completed if you see explicit success confirmation. If a step
was started but not explicitly confirmed complete, mark it as "IN-PROGRESS".
- Never infer completion from context.
- Return plain text only. Do not include tool calls, JSON, or markdown headers.
The summary replaces the messages between messages[2] (after system + task+plan) and the last K=6 iterations. The new shape:
[0] system
[1] user (task + plan)
[2] user (<compacted-history>{summary}</compacted-history>)
[3-N] last K iterations of full messages (intact)
C. Configuration
interface CompactionConfig {
enabled: boolean; // default false (opt-in)
triggerEveryNSteps: number; // default 25
triggerCharThreshold: number; // default 40000
keepLastNIterations: number; // default 6
summarizerModel?: LanguageModel; // default: use main provider
summarizerMaxTokens?: number; // default 2000
}
interface WebAgentOptions {
// ...
compaction?: Partial<CompactionConfig>;
}
D. Failure modes
- Summarizer LLM call fails: log a warning, skip compaction this iteration, continue with full history.
- Summarizer returns empty / suspicious output: same.
- Summarizer times out: configurable timeout (default 30s); skip on timeout.
These all degrade gracefully — the run continues without the compaction.
E. Audit trail
Emit a HISTORY_COMPACTED event:
{
iterationId: string;
iteration: number;
beforeMessageCount: number;
afterMessageCount: number;
estimatedTokensSaved: number;
summaryLength: number;
}
Save the pre-compaction history snapshot to telemetry / debug logs so post-hoc debugging is possible (a bad summary that derailed the run is recoverable for investigation).
F. Compaction wrapper for prompt injection
The summarizer's output gets wrapped in <compacted-history>...</compacted-history> and treated as external content (the existing <EXTERNAL-CONTENT> pattern). It's content the model needs to read but not act on directly.
Implementation notes
- The summarizer call is expensive (a full LLM round-trip with potentially 30k tokens of input). It's worth it when it lets the main loop continue rather than crash, but the cost matters. Budget: aim for one compaction per task in the worst case.
- "Keep last N iterations" is intentional — recent context is where the model's working memory lives. Summarizing too aggressively (keeping only 1-2 iterations) breaks the agent's ability to recover from immediately prior errors.
- The summarizer's prompt explicitly enforces "only confirmed complete" to prevent hallucinated progress. This is essential — without it, the summary may say "completed step 4" when step 4 actually errored and was retried.
- Compaction must preserve AI SDK tool-call/tool-result pairing for the messages it keeps. Don't break this invariant.
- Test scenarios:
- 30-iteration task with compaction triggering at iteration 25.
- Summarizer LLM fails → run continues with full history.
- Summarizer hallucinates progress → validator catches the resulting bad
done().
- Multiple compactions in one run (>50 iterations).
Acceptance criteria
- Compaction triggers correctly on both step and char thresholds.
- Disabled-by-default; opt-in via
compaction.enabled.
- Summary replaces middle history, keeps system + task+plan + last K iterations.
- Failure modes degrade gracefully without aborting the task.
HISTORY_COMPACTED event fires with telemetry.
- The summarizer prompt enforces "only-confirmed-complete" anti-hallucination.
- Tests cover: trigger thresholds, success path, summarizer-fails fallback, tool-call pairing preserved.
Effort estimate
4-6 days. The summarizer prompt design and tuning is the time-consuming part — getting it to produce useful summaries that don't drift requires evals.
Related issues
Strict superset of the Tier-2 history-clipping issue. Land that first. This one only matters if real workloads regularly hit the limits clipping alone leaves.
Files likely affected
packages/core/src/webAgent.ts (compaction trigger, message rewrite)
packages/core/src/compactionManager.ts (new file)
packages/core/src/prompts.ts (summarizer prompt)
packages/core/src/events.ts (HISTORY_COMPACTED)
packages/core/src/types/ (WebAgentOptions extensions)
packages/core/src/config/defaults.ts
packages/core/test/
Current state
Pilo's conversation history grows monotonically across iterations. The current trimming strategy (
truncateOldExternalContent,webAgent.ts:743-774) clips the body of old<EXTERNAL-CONTENT>blocks before each new snapshot push. A separate Tier-2 issue ("aggressive history clipping for long runs") proposes extending this to clip old assistant tool-call messages, feedback messages, and tool results.For typical 10-30 iteration tasks, that aggressive clipping is enough. But:
maxIterations: 50accumulate a lot of clipped-but-still-present content.A more aggressive approach: periodically summarize old history into a single compact block, then drop the underlying messages entirely.
The gap
This issue is the bigger-effort successor to the Tier-2 clipping issue. Land that one first; this one only becomes valuable if real workloads regularly exceed what clipping alone handles.
Reasons to do it:
Reasons to delay:
Proposed scope
A. Compaction trigger
Two conditions, either fires:
B. Compaction strategy
Call a separate LLM (default: the same provider, possibly with a cheaper model class) with a tight summarization prompt:
The summary replaces the messages between
messages[2](after system + task+plan) and the last K=6 iterations. The new shape:C. Configuration
D. Failure modes
These all degrade gracefully — the run continues without the compaction.
E. Audit trail
Emit a
HISTORY_COMPACTEDevent:Save the pre-compaction history snapshot to telemetry / debug logs so post-hoc debugging is possible (a bad summary that derailed the run is recoverable for investigation).
F. Compaction wrapper for prompt injection
The summarizer's output gets wrapped in
<compacted-history>...</compacted-history>and treated as external content (the existing<EXTERNAL-CONTENT>pattern). It's content the model needs to read but not act on directly.Implementation notes
done().Acceptance criteria
compaction.enabled.HISTORY_COMPACTEDevent fires with telemetry.Effort estimate
4-6 days. The summarizer prompt design and tuning is the time-consuming part — getting it to produce useful summaries that don't drift requires evals.
Related issues
Strict superset of the Tier-2 history-clipping issue. Land that first. This one only matters if real workloads regularly hit the limits clipping alone leaves.
Files likely affected
packages/core/src/webAgent.ts(compaction trigger, message rewrite)packages/core/src/compactionManager.ts(new file)packages/core/src/prompts.ts(summarizer prompt)packages/core/src/events.ts(HISTORY_COMPACTED)packages/core/src/types/(WebAgentOptions extensions)packages/core/src/config/defaults.tspackages/core/test/