You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After #395 shipped the 30-char short-message defense in depth (BM25 dampening + tier-selector active-thread override + AgentLoopRunner memory-summary guard + behavioural directive), the shorter failure case from #383 is gone. But the longer variant from the same incident report is not addressed and remains reproducible.
user: Hopefully we can go this coming winter. My health seems better now (67 chars, 12 words)
assistant: Noted. I've got that on the travel ledger: hoping for Cathedral City this coming winter,
with health in a better place now.
The user message DOES introduce a new fact ("hoping to go this winter, health better now") that is on-topic for the very recent thread, but the reply still summarises the just-stored memory entry as a non-sequitur closing rather than continuing the conversation — and gets routed through Low tier since the prompt is conversational, not complex.
AgentLoopRunner memory-summary guard (Short-message wrong-context defense in depth (#383) #395): Gated on originalUserRequest.Length ≤ 30, so the guard doesn't fire even though the response does match MemorySummaryReplyRegex.
The 30-char threshold cleanly catches one-liner follow-ups ("ok", "I'll find out soon"). It does NOT catch fact-introducing short paragraphs because they're above the lexical-noise band. The failure mode in this incident isn't "low-signal message + noisy BM25 hits" — it's "user introduces a small fact AND the model fixates on storing-and-summarising it instead of continuing the discussion."
A length-only heuristic can't tell the difference between:
"Hopefully we can go this coming winter. My health seems better now" → fact-introducing follow-up that SHOULD save AND continue the conversation
"What is the capital of France?" → similar length, no thread context to continue
What differs is recent-window topical overlap: the user's prompt overlaps highly with the last few assistant/user turns. The BM25 / embedding query that injects the noisy entries should be biased toward the recent window, not the raw incoming message in isolation.
Proposed approaches
Roughly cheapest → deepest, similar shape to the original #383 proposals:
Per-turn BM25 query enrichment. Instead of querying long-term memory with just the incoming user message, query with the incoming message concatenated with the last N=2-3 user/assistant turns. The current top-of-thread is in the query, so BM25 stays anchored on what's actually being discussed.
Recent-window embedding overlap gate. Compute the cosine similarity between the user message embedding and an embedding of the recent-history window. When overlap is high (≥ some threshold), suppress the BM25-delta injections entirely — the model already has the topic in conversation history, no need to inject memory.
Tier-selector window-overlap signal. Like the active-thread heuristic in Short-message wrong-context defense in depth (#383) #395, but instead of priorTurns.Count >= 3 && recent, compute embedding overlap between the new message and the recent window. Promote Low → Balanced when overlap is high (means the message IS continuing an established thread, regardless of length).
Expand the memory-summary guard's user-message length gate. Replace the hard 30-char cap with a more sophisticated condition — e.g. "user message length ≤ 80 chars AND `SaveMemory` content has low novel-keyword overlap with the user message AND response matches `MemorySummaryReplyRegex`." Higher false-positive risk but catches the longer variant.
Behavioural directive refinement. The current common-directive section is gated implicitly on the model deciding what counts as a "short follow-up." Tighten the language to call out the specific failure: "Even when the user introduces a new fact, your reply must address what they said in the context of the active thread, not just narrate that you stored the fact."
Goals
The Cathedral City reproducer (67-char fact-introducing follow-up on a topical thread) produces a thread-continuing reply, not a memory-summary closing.
Defenses combine; this should NOT replace the 30-char path, just extend coverage above it.
Telemetry on whether the recent-window overlap signal is reliable enough to gate routing or guard decisions on it.
Building a per-turn embedding-similarity service if one doesn't already exist — start with cosine over the existing shared query embedding.
A general "don't fixate on stored memory" rule. The fix should be targeted at the specific signature: model invoked SaveMemory AND the reply is a memory-state summary rather than a topical continuation.
`AgentLoopRunner.MemorySummaryReplyRegex` is already calibrated to match this phrasing; the guard just doesn't engage because the length gate excludes it. Reusing the regex (with a different length gate) is one viable path.
Symptom
After #395 shipped the 30-char short-message defense in depth (BM25 dampening + tier-selector active-thread override + AgentLoopRunner memory-summary guard + behavioural directive), the shorter failure case from #383 is gone. But the longer variant from the same incident report is not addressed and remains reproducible.
From #383, Incident 1 (2026-05-10T02:53):
The user message DOES introduce a new fact ("hoping to go this winter, health better now") that is on-topic for the very recent thread, but the reply still summarises the just-stored memory entry as a non-sequitur closing rather than continuing the conversation — and gets routed through Low tier since the prompt is conversational, not complex.
Why the existing defenses don't catch this
AgentContextBuildershort-message gate (Skip BM25 topic injection on short user messages (#383) #384): 67 chars > 30, so the BM25 dampening doesn't engage.promptText.Length ≤ 30, so this prompt routes Low as normal.AgentLoopRunnermemory-summary guard (Short-message wrong-context defense in depth (#383) #395): Gated onoriginalUserRequest.Length ≤ 30, so the guard doesn't fire even though the response does matchMemorySummaryReplyRegex.gpt-5.4-miniat Low tier.Diagnosis
The 30-char threshold cleanly catches one-liner follow-ups ("ok", "I'll find out soon"). It does NOT catch fact-introducing short paragraphs because they're above the lexical-noise band. The failure mode in this incident isn't "low-signal message + noisy BM25 hits" — it's "user introduces a small fact AND the model fixates on storing-and-summarising it instead of continuing the discussion."
A length-only heuristic can't tell the difference between:
What differs is recent-window topical overlap: the user's prompt overlaps highly with the last few assistant/user turns. The BM25 / embedding query that injects the noisy entries should be biased toward the recent window, not the raw incoming message in isolation.
Proposed approaches
Roughly cheapest → deepest, similar shape to the original #383 proposals:
Per-turn BM25 query enrichment. Instead of querying long-term memory with just the incoming user message, query with the incoming message concatenated with the last N=2-3 user/assistant turns. The current top-of-thread is in the query, so BM25 stays anchored on what's actually being discussed.
Recent-window embedding overlap gate. Compute the cosine similarity between the user message embedding and an embedding of the recent-history window. When overlap is high (≥ some threshold), suppress the BM25-delta injections entirely — the model already has the topic in conversation history, no need to inject memory.
Tier-selector window-overlap signal. Like the active-thread heuristic in Short-message wrong-context defense in depth (#383) #395, but instead of
priorTurns.Count >= 3 && recent, compute embedding overlap between the new message and the recent window. Promote Low → Balanced when overlap is high (means the message IS continuing an established thread, regardless of length).Expand the memory-summary guard's user-message length gate. Replace the hard 30-char cap with a more sophisticated condition — e.g. "user message length ≤ 80 chars AND `SaveMemory` content has low novel-keyword overlap with the user message AND response matches `MemorySummaryReplyRegex`." Higher false-positive risk but catches the longer variant.
Behavioural directive refinement. The current common-directive section is gated implicitly on the model deciding what counts as a "short follow-up." Tighten the language to call out the specific failure: "Even when the user introduces a new fact, your reply must address what they said in the context of the active thread, not just narrate that you stored the fact."
Goals
Non-goals
Evidence
Related