Skip to content

LLM-based history compaction for long tasks #441

@lmorchard

Description

@lmorchard

Current state

Pilo's conversation history grows monotonically across iterations. The current trimming strategy (truncateOldExternalContent, webAgent.ts:743-774) clips the body of old <EXTERNAL-CONTENT> blocks before each new snapshot push. A separate Tier-2 issue ("aggressive history clipping for long runs") proposes extending this to clip old assistant tool-call messages, feedback messages, and tool results.

For typical 10-30 iteration tasks, that aggressive clipping is enough. But:

  • Tasks that hit maxIterations: 50 accumulate a lot of clipped-but-still-present content.
  • The structure of the conversation (50 assistant turns, 50 tool results, all the intermediate noise) costs tokens even when content is clipped.
  • The model's attention spread across 50+ messages is suboptimal — recent context gets less proportional weight.

A more aggressive approach: periodically summarize old history into a single compact block, then drop the underlying messages entirely.

The gap

This issue is the bigger-effort successor to the Tier-2 clipping issue. Land that one first; this one only becomes valuable if real workloads regularly exceed what clipping alone handles.

Reasons to do it:

  • Long-running research tasks (e.g., a benchmark task with maxIterations=100).
  • Tasks where each iteration adds substantial content (extract calls returning large markdown, search results, etc.).
  • Pushing the long-tail context-window failures from "crashes opaquely after N iterations" to "soft-summarizes and continues."

Reasons to delay:

  • Most current tasks don't hit this threshold.
  • Adds complexity and an additional LLM dependency (the summarizer).
  • Summary quality is unpredictable; a bad summary can make subsequent steps worse than the original messages would have.

Proposed scope

A. Compaction trigger

Two conditions, either fires:

  • Step cadence: every K iterations (default K=25).
  • Token threshold: estimated history token count > T (default T=40000).

B. Compaction strategy

Call a separate LLM (default: the same provider, possibly with a cheaper model class) with a tight summarization prompt:

You are summarizing an agent run for prompt compaction. The agent is a web-browsing
assistant working through a multi-step task. Summarize the conversation so the agent
can continue with much less history while keeping all task-relevant context.

Capture:
- Task requirements (re-state precisely)
- Key facts and data the agent has gathered (with verbatim values)
- Decisions the agent has made and why
- Partial progress (what's been completed)
- Errors encountered and how they were handled
- The current strategic situation

Preserve:
- Important entities, values, URLs, file paths, identifiers verbatim
- Anything the agent referred back to in recent steps

Critical rules:
- Only mark a step as completed if you see explicit success confirmation. If a step
  was started but not explicitly confirmed complete, mark it as "IN-PROGRESS".
- Never infer completion from context.
- Return plain text only. Do not include tool calls, JSON, or markdown headers.

The summary replaces the messages between messages[2] (after system + task+plan) and the last K=6 iterations. The new shape:

[0] system
[1] user (task + plan)
[2] user (<compacted-history>{summary}</compacted-history>)
[3-N] last K iterations of full messages (intact)

C. Configuration

interface CompactionConfig {
  enabled: boolean;                  // default false (opt-in)
  triggerEveryNSteps: number;        // default 25
  triggerCharThreshold: number;      // default 40000
  keepLastNIterations: number;       // default 6
  summarizerModel?: LanguageModel;   // default: use main provider
  summarizerMaxTokens?: number;      // default 2000
}

interface WebAgentOptions {
  // ...
  compaction?: Partial<CompactionConfig>;
}

D. Failure modes

  • Summarizer LLM call fails: log a warning, skip compaction this iteration, continue with full history.
  • Summarizer returns empty / suspicious output: same.
  • Summarizer times out: configurable timeout (default 30s); skip on timeout.

These all degrade gracefully — the run continues without the compaction.

E. Audit trail

Emit a HISTORY_COMPACTED event:

{
  iterationId: string;
  iteration: number;
  beforeMessageCount: number;
  afterMessageCount: number;
  estimatedTokensSaved: number;
  summaryLength: number;
}

Save the pre-compaction history snapshot to telemetry / debug logs so post-hoc debugging is possible (a bad summary that derailed the run is recoverable for investigation).

F. Compaction wrapper for prompt injection

The summarizer's output gets wrapped in <compacted-history>...</compacted-history> and treated as external content (the existing <EXTERNAL-CONTENT> pattern). It's content the model needs to read but not act on directly.

Implementation notes

  • The summarizer call is expensive (a full LLM round-trip with potentially 30k tokens of input). It's worth it when it lets the main loop continue rather than crash, but the cost matters. Budget: aim for one compaction per task in the worst case.
  • "Keep last N iterations" is intentional — recent context is where the model's working memory lives. Summarizing too aggressively (keeping only 1-2 iterations) breaks the agent's ability to recover from immediately prior errors.
  • The summarizer's prompt explicitly enforces "only confirmed complete" to prevent hallucinated progress. This is essential — without it, the summary may say "completed step 4" when step 4 actually errored and was retried.
  • Compaction must preserve AI SDK tool-call/tool-result pairing for the messages it keeps. Don't break this invariant.
  • Test scenarios:
    • 30-iteration task with compaction triggering at iteration 25.
    • Summarizer LLM fails → run continues with full history.
    • Summarizer hallucinates progress → validator catches the resulting bad done().
    • Multiple compactions in one run (>50 iterations).

Acceptance criteria

  • Compaction triggers correctly on both step and char thresholds.
  • Disabled-by-default; opt-in via compaction.enabled.
  • Summary replaces middle history, keeps system + task+plan + last K iterations.
  • Failure modes degrade gracefully without aborting the task.
  • HISTORY_COMPACTED event fires with telemetry.
  • The summarizer prompt enforces "only-confirmed-complete" anti-hallucination.
  • Tests cover: trigger thresholds, success path, summarizer-fails fallback, tool-call pairing preserved.

Effort estimate

4-6 days. The summarizer prompt design and tuning is the time-consuming part — getting it to produce useful summaries that don't drift requires evals.

Related issues

Strict superset of the Tier-2 history-clipping issue. Land that first. This one only matters if real workloads regularly hit the limits clipping alone leaves.

Files likely affected

  • packages/core/src/webAgent.ts (compaction trigger, message rewrite)
  • packages/core/src/compactionManager.ts (new file)
  • packages/core/src/prompts.ts (summarizer prompt)
  • packages/core/src/events.ts (HISTORY_COMPACTED)
  • packages/core/src/types/ (WebAgentOptions extensions)
  • packages/core/src/config/defaults.ts
  • packages/core/test/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions