Watching an Assistant Forget: A TKOS Log-Replay Case Study
A preregistered offline audit of whether long-running assistant workflows preserve warranted operational state. Tracks 11,262 belief instances across 20,190 evaluation turns sampled from 164 sessions; the intervention catalog (v0.1 and v0.2) is the evaluation mechanism, with full TP/FP/FN/TN accounting.
Planned canonical URL:
https://topicspace.ai/research/case-studies/tkos-log-replay-v1Controlling question
Long-running assistants do not only need more memory. They need maintained operational state — the active reasoning trace carries assumptions that age, get refreshed, and can be contradicted as work proceeds.
The deeper question this case study tests is therefore not “can we suppress bad actions,” but which assumption inside the active reasoning trace is still warranted right now? TKOS reconstructs the warrant state across long sessions; the intervention catalog is how we measure whether the reconstruction is faithful enough to back action-gating downstream.
The claim is not that TKOS improves Claude. This is an offline replay study, not a live intervention study. The distinction matters: TKOS tracks whether a belief still has authority; governance decides whether that authority is enough to act. The case study evaluates the first half; the second half is the evaluation mechanism.
What this case study tests
This case study tests the Belief Stack pattern on a typed operational substrate: engineering assistant logs.
The raw material is long-running workflow behavior: tool calls, tool errors, pipeline runs, validation steps, deployment readiness, report generation, and correction loops.
That makes it a useful substrate for TKOS because the assistant's work depends on operational beliefs that persist across turns: the pipeline is still running; a fix has been attempted; validation is pending; a deploy is waiting on approval; a report is ready; a prior failure has been resolved.
Those beliefs are not single-turn facts. They age. They can be refreshed. They can be contradicted. And if the assistant acts on them without checking whether they still hold, the workflow can drift.
What this does not claim
- Does not claim that TKOS improves Claude.
- Does not claim live runtime impact.
- Does not claim the v0.1 or v0.2 intervention rules are production-ready.
- Does not generalize beyond one user's Claude session logs.
It tests whether an offline replay harness can make state-level assistant failures measurable.
Phase 1: Building the replay substrate
Phase 1 converted raw Claude session logs into a replayable Belief Stack substrate.
The parser normalized 83,271 turns from 164 JSONL files across roughly 10.5 weeks of work. The corpus included 28,946 tool calls and 1,309 explicit tool errors.
Each turn was assigned to one of seven typed operational regions where possible: data fetch, pipeline run, failure diagnosis, validation, deploy readiness, report generation, and evidence sealing.
Turns outside that operational substrate were left UNCLASSIFIED. That high UNCLASSIFIED rate was intentional. The goal was not to force every conversational turn into a region. The goal was to identify the action-bearing parts of the workflow and leave ordinary conversation outside coverage.
Every classified turn emitted both a label and a warrant. That is the Belief Stack representation contract in miniature: no label without warrant.
Phase 2: From per-turn facts to cross-turn beliefs
Phase 2 introduced cross-turn state-level beliefs. Unlike per-turn tool facts, these beliefs decay. The replay layer tracked eight state beliefs:
pipeline_running, pipeline_failed, issue_under_diagnosis, fix_attempted, validation_pending, deploy_pending, report_ready, and user_approval_required.
Each belief had birth, refresh, retirement, contradiction, and decay rules. The purpose was not to infer vague intent. The purpose was to reconstruct operational state: what the assistant appeared to believe was true at a given point in the workflow, and whether that belief still had enough warrant to authorize action.
Phase 2 sampled 20,190 evaluation turns from the 83,271-turn universe, stratified by session with a 200-turn-per-session cap and a locked random seed (20260529). Across those 20,190 evaluation points, 11,262 belief instances were tracked across the session ledger — births, refreshes, contradictions, and stale-decay retirements all logged with timestamps.
Phase 2 v0.1: The first locked measurement
The v0.1 intervention catalog tested four patterns: repeated_failure_loop, stale_deploy_prior, stale_pipeline_prior, and contradicted_fix_prior.
The important result from v0.1 was not that the catalog worked. It mostly did not.
Three of four rules were structurally too narrow or ambiguous. repeated_failure_loop never fired. stale_deploy_prior never fired. contradicted_fix_prior had zero applicable turns. Only stale_pipeline_prior fired at scale.
That is still a useful result. The replay harness did what it was supposed to do: it made the failure of the rules visible instead of allowing the system to narrate success.
What v0.1 taught us
v0.1 exposed four concrete problems.
First, repeated-failure-loop detection was too strict. Exact or near-exact signature matching did not capture how repeated failures appear in real assistant logs.
Second, deploy gating had a threshold-semantics problem. The approval requirement rule was written in a way that made it nearly unable to fire.
Third, contradicted-fix detection was too narrow. It depended on Bash validation failures and missed broader post-fix tool errors.
Fourth, stale pipeline detection was measurable but miscalibrated. A 20-minute threshold produced many suppressions, but also a non-trivial false-positive rate.
The result was not a failed proof of concept. It was a successful falsification pass.
Phase 2 v0.2: Amended rules, same measurement discipline
v0.2 kept the same sample, same success criterion, same belief definitions, and same labeling protocol. Only the intervention rules changed.
The amendments renamed the intervention threshold, fixed deploy-gating semantics, loosened repeated-failure signature matching, broadened contradicted-fix detection with context overlap, and moved stale-pipeline threshold from 20 minutes to 30 minutes.
v0.2 was not treated as a replacement for v0.1. It was reported separately, with a head-to-head comparison. That separation matters. v0.1 remains the record of what the first locked rules did. v0.2 tests whether specific amendments improve measurement coverage without hiding the earlier failure.
What v0.2 showed
Head-to-head: v0.1 vs v0.2
Same 20,190-turn sample, same belief timelines, same 5-turn lookahead for labeling. The only thing that changed is the intervention rule operationalizations.
The table should be read as a calibration artifact, not a scoreboard: v0.2 improved some measurements, exposed new modeling failures, and left other rules unchanged.
Counts
| Rule | Applicable | SUPPRESS | TP | FP | FN | TN |
|---|---|---|---|---|---|---|
| repeated_failure_loop | 830 → 830 | 0 → 0 | 0 → 0 | 0 → 0 | 167 → 167 | 643 → 643 |
| stale_deploy_prior | 126 → 126 | 0 → 0 | 0 → 0 | 0 → 0 | 17 → 17 | 109 → 109 |
| stale_pipeline_prior | 3,146 → 3,146 | 558 → 242 | 41 → 21 | 517 → 221 | 190 → 210 | 2,379 → 2,675 |
| contradicted_fix_prior | 0 → 178 | 0 → 178 | 0 → 42 | 0 → 136 | 0 → 0 | 0 → 0 |
Each cell shows v0.1 → v0.2. The sample, beliefs, and labeling protocol are unchanged across versions; only the intervention rule operationalizations changed.
Rates
| Rule | Detection rate | False-positive rate | Precision (v0.2) | Main read |
|---|---|---|---|---|
| repeated_failure_loop | 0.000 → 0.000 | 0.000 → 0.000 | n/a | Loosened signature predicate still produced 0 suppressions. Either corpus has few same-signature 3-in-10 loops, or the definition still doesn't match real loop shape. Hand-review is the next step. |
| stale_deploy_prior | 0.000 → 0.000 | 0.000 → 0.000 | n/a | Inverting the threshold did not help. Structural finding:user_approval_required retires on the same signal that births deploy_pending; the two beliefs don't co-exist at the deploy moment. |
| stale_pipeline_prior | 0.177 → 0.091 | 0.179 → 0.076 | — | Threshold trade-off visible: 20→30 min more than halves FPR but also halves detection. Suggests a global timeout is the wrong tool — per-task duration priors needed. |
| contradicted_fix_prior | degenerate | degenerate | 0.236 | Detection/FPR degenerate (no ALLOW verdicts by construction); precision = 0.24. v0.2 made the rule measurable but still admits incidental same-file errors. Add temporal constraint in v0.3. |
The rates table is the more important of the two: contradicted_fix_prior has no ALLOW verdicts by construction (applicability implies SUPPRESS), so detection-rate and FPR denominators collapse. Precision is the meaningful metric for that rule; for the others, detection and FPR are the legible axes.
The prose below walks the table row by row.
Repeated-failure-loop
Repeated-failure-loop detection still did not fire. Even after loosening signature matching to a disjunction — same tool plus error-message Jaccard ≥ 0.5, or shared file path with shared command first-token, or shared exception class — the rule produced zero suppressions. That suggests either the corpus contains few three-in-ten-turn same-signature loops, or the current definition still does not match how loops actually appear in assistant workflows.
Stale deploy gating
Stale deploy gating also still did not fire. This time the result pointed away from the intervention rule and toward the belief lifecycle. By the time a deploy action appears, user_approval_required has usually already been retired by the same user signal that births deploy_pending. The two beliefs do not co-exist when the intervention needs them. That is a state-modeling issue, not just a rule issue.
Stale pipeline detection
Stale pipeline detection showed a clear threshold tradeoff. Moving from 20 minutes to 30 minutes reduced suppressions (558 → 242) and more than halved the false-positive rate (0.179 → 0.076), but it also halved detection (0.177 → 0.091). That suggests the problem is not solved by a single global timeout. Pipeline expectations likely need duration priors by task type.
Contradicted-fix detection
Contradicted-fix detection became measurable. v0.1 had zero applicable turns; v0.2 surfaced 178. But precision was low: 42 of those 178 contradictions held up in the five-turn lookahead window, and 136 did not. The context-overlap predicate caught real validation failures but also incidental same-file or same-command errors that did not actually invalidate the fix. The rule needs a stronger temporal constraint or a sharper distinction between validation-context failures and incidental overlap.
What the stack made visible
A normal log parser can tell you that a command failed. A normal dashboard can count tool errors. The Belief Stack view asks a different question:
That is the difference between observing events and maintaining operational state.
In this case study, the stack made four things visible:
- operational regions within long conversations
- state-level beliefs that persisted across turns
- intervention rules whose failures could be measured
- lifecycle/modeling errors that would have been invisible in a simple tool-error count
The most important finding is not that v0.2 “worked.”
The most important finding is that v0.2 turned broad v0.1 failures into narrower modeling questions.
Why this matters
Long-running assistants are increasingly used for engineering workflows, research workflows, analysis pipelines, and deployment-adjacent work.
In those settings, failures often do not come from a single bad answer.
They come from stale operational state: assuming a pipeline is still running when it has failed; assuming a fix worked before validation; retrying the same broken action; deploying from an outdated approval state; carrying forward an old plan after evidence changed.
These are not primarily memory failures. They are state-management failures.
TKOS is the hypothesis that these failures can be made explicit enough to audit, suppress, or eventually prevent.
Limits
- Offline replay study. Ground truth is retrospective.
- Single substrate. The corpus comes from one user's Claude sessions over roughly 10.5 weeks.
- Early intervention rules. The v0.1 and v0.2 catalogs are not production-ready.
- No live impact claim. The strongest claim available at this stage is that the replay method can reconstruct assistant state, apply preregistered intervention rules, and falsify those rules with explicit false-positive accounting.
Next validation
The next step is not a public runtime claim. It is v0.3.
v0.3 should begin with hand-review of repeated-failure-loop candidate windows, because further loosening without examples risks inventing a detector for a pattern that may not exist in this corpus.
It should also revise user_approval_required lifecycle behavior, likely by separating approval_pending from approval_observed.
For stale_pipeline_prior, v0.3 should test adaptive expected-duration priors instead of a global 20- or 30-minute threshold.
For contradicted_fix_prior, v0.3 should add a temporal window around the fix attempt and distinguish validation-context failures from incidental overlap.
If those changes improve detection while reducing false positives, the case study can move from “measurement harness validated” to “intervention catalog beginning to calibrate.” If they do not, that is also useful. It means the state representation or labeling protocol needs revision before any live-runtime claim is justified.
Related research
- Belief Stack v0.1 specification — the architectural pattern this case study instantiates.
- Reading the TopicSpace Belief Field: A v1 Case Study — sibling case study on a soft/latent substrate (financial narratives).
- The LLM That Forgot Time — the failure-mode anchor for runtime belief lifecycle.
- A pattern for problems where beliefs must evolve — the architectural framing this case study tests.
Citation
BibTeX:
APA:
Follow the research
Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.
I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.