topicspace.ai · research · case study

Belief Stack v0.3 — planning-side experiment

Maintained belief state outperforms raw context on a bounded planning task — at 14% of the input tokens and 3.2× lower latency.

locked · run-complete2026-06-03

Pre-registration: /research/pre-registrations/belief-stack-v0-3
Predecessors: TKOS Log-Replay v1 (substrate), the Belief Stack spec (the architectural pattern).

Headline

Maintained state is a planning primitive.

On the same 75-question substrate used in v0.1 and v0.2.2, an agent given only a maintained belief overlay (Arm B, mean 285 input tokens) outperformed an agent given the full raw session log (Arm A, mean 2,037 input tokens) on planning correctness by 8 percentage points.

Arm	Mean input tokens	% of A	Mean wall-sec	Planning correctness
A — raw log K=20, strong baseline prompt	2,037	100%	3.55s	90.7%
B — belief overlay only	285	14%	1.11s	98.7%
C — overlay + K=3 scratchpad	592	29%	1.29s	94.7%

The architecture is now an empirical claim. The model was not under-informed when given less raw context. It was overburdened by reconstruction when given more. Maintained state with provenance and lifecycle reduced the planning-error rate and the token cost and the latency.

What v0.3 tested

v0.1 and v0.2.2 demonstrated the inspection-side case: a maintained belief overlay reduces workflow-state errors when added to the raw recent log. Those were additive comparisons (System B had everything System A had, plus the overlay).

v0.3 tested the planning-side case under a substitutive comparison.

The three arms

Arm A — Raw Context Large (strong baseline). Full K=20 raw session log + a system prompt that explicitly instructs reconstruction of workflow state from the raw history before answering. Designed so the result cannot be dismissed as a poorly-prompted baseline.
Arm B — Belief Overlay Only. The compact §3.5a-deduped belief overlay (budget 500 tokens, the comfortable arm from v0.2.2). No raw log. No scratchpad. Substitutive.
Arm C — Overlay + Minimal Scratchpad. The same overlay plus the last K=3 turns of the session for execution-time detail.

Locked research question

Can an agent perform a planning task using a maintained belief state with substantially less raw context, at equal or better quality than an agent using larger raw context?

Bounded claim

Not: agents can plan from belief state alone.

But: maintained belief state can support planning with less raw context, while preserving provenance and lifecycle signals that generic summaries lack.

Fair-comparison constraint (load-bearing)

Arm A received the same underlying evidence that the belief overlay was derived from — the raw K=20 session log. The only difference between arms was whether that evidence arrived as raw history (Arm A) or as maintained state (Arms B/C). Without this constraint, the comparison would measure substrate-builder effort instead of the belief-state architecture.

Locked decisions

Substrate reused from v0.1 / v0.2.2 (same 75 questions, 164 sessions, 13,481 belief instances). Single-next-action evaluation (D5). Deterministic gate primary, LLM judge as the answer-side classifier under oracle-wins disagreement policy (D6). Token gate: ≥ 50% input-token reduction for the belief arms (D7). Quality gate: belief arm within 2 percentage points of Arm A on aggregate planning correctness, or beats Arm A outright (D8). Full table in the locked pre-registration.

Context and feasibility results

Token economy — the D7 gate

Arm	Mean	p50	p90	Max	% of Arm A	D7 (≤50%)
A	2,037	1,662	3,416	4,637	100%	baseline
B	285	139	242	331	14%	✓ pass by 36 pts
C	592	358	895	2,223	29%	✓ pass by 21 pts

The D7 token gate is not just passing; it is irrelevant. Arm B is 13× smaller than Arm A on average input tokens. The pre-registration concern that “materially fewer tokens” might be unfalsifiable is gone — the reductions are an order of magnitude, not a few percent.

Latency — the reconstruction tax, quantified

Arm	Mean wall-sec / call	Speedup vs A
A	3.55s	1.0×
B	1.11s	3.20×
C	1.29s	2.75×

Arm B answers in 31% of the time Arm A takes. This is the reconstruct-world-model-every-step tax measured directly. The model with maintained state does not need to do the reconstruction work — it has the state already, and the inference call is correspondingly shorter.

The mean output tokens also shrink — Arm A produces ~105 output tokens per answer (it narrates through the reconstruction the prompt asked for); Arm B produces ~50 (it answers directly from the belief overlay). The compression is bidirectional: shorter prompts, shorter answers, faster turn-around.

Generation feasibility

All 225 generations clean. No feasibility failures of any kind — 0 / 75 / 75 / 75 across A / B / C. No context_too_long, no rate-limit retries.

Deterministic results

Aggregate planning correctness (paired n=75)

Arm	Errors / applicable	Error rate	Correctness	Δ vs Arm A
A	7 / 75	9.3%	90.7%	—
B	1 / 75	1.3%	98.7%	+8.0 pts
C	4 / 75	5.3%	94.7%	+4.0 pts

Both belief arms clear the D8 quality gate (within 2 pts of Arm A or beat outright) by beating Arm A outright. Arm B’s margin is large: a single committed failure mode across 75 paired questions, versus seven for the raw-context baseline.

Per-metric breakdown (each metric applies to 15 questions)

Failure mode	A	B	C
stale_validation_assumption	7% (1/15)	0% (0/15)	0% (0/15)
repeated_failure_loop	7% (1/15)	0% (0/15)	0% (0/15)
premature_action	13% (2/15)	0% (0/15)	7% (1/15)
false_completion_claim	7% (1/15)	7% (1/15)	13% (2/15)
missing_pause	13% (2/15)	0% (0/15)	7% (1/15)

Arm B beats Arm A on every metric except false_completion_claim, where the two are tied at 7%. The four metrics B wins clean (stale_validation, repeated_failure_loop, premature_action, missing_pause) are exactly the failure modes that maintained-state-with- lifecycle is designed to surface. The architecture is doing what the spec says it does.

Grounding-bankruptcy candidates

Zero. Across all 75 paired questions, there is not a single case where Arm B fails and both Arm A and Arm C succeed. The pre-registered “B falls apart on execution detail; C recovers” hypothesis is rejected at this n. The belief overlay was sufficient — adding scratchpad did not unlock any case that overlay alone could not handle.

The unexpected result — Arm B beats Arm C

The pre-registration anticipated that Arm C (overlay + scratchpad) would be the production-realistic architecture and secondary hypothesis. The actual result inverts the prediction:

B: 98.7% correctness, 285 tokens, 1.11s
C: 94.7% correctness, 592 tokens, 1.29s
B beats C by 4 percentage points at half the tokens.

The scratchpad didn’t help. It very slightly hurt.

Where C loses ground

C commits 1 premature_action (vs B’s 0): the scratchpad’s recent-turn detail nudged the model toward action when the belief overlay alone correctly held it back.
C commits 1 extra false_completion_claim (2 vs B’s 1): the scratchpad’s tool-output detail nudged the model toward declaring done when the belief overlay alone correctly identified pending state.
C commits 1 missing_pause (vs B’s 0): same pattern — recent-turn detail competed with the maintained pause-signal in the overlay.

Interpretation

The K=3 scratchpad is small enough to fit the token budget but large enough to compete with the maintained state for the model’s attention. The scratchpad re-introduces a slice of the reconstruction tax — the model has to integrate two views of the workflow (overlay + recent turns) rather than acting cleanly from one (overlay alone).

This does not mean scratchpad is never useful. But for this dataset, this task, this model, the pure-overlay arm is strictly better than the hybrid. The “production-realistic” framing of Arm C in the pre-registration may itself need revision: on bounded planning decisions where the belief overlay carries the relevant state, scratchpad is overhead.

What this proves — and does not

Claims (load-bearing)

Maintained belief state outperforms raw context on a bounded planning task. On 75 paired single-next-action questions across the v0.1 substrate, Arm B (overlay-only, 14% of Arm A’s input tokens) reached 98.7% correctness vs Arm A’s 90.7%.
The reconstruct-world-model-every-step tax is real and measurable. Arm B answers in 31% of Arm A’s wall time and uses 14% of the input tokens. The difference is not optimization; it is the absence of reconstruction work.
Compression with provenance and lifecycle is not the same as compression. Arm B is not a summary of Arm A. It is the maintained-state form of the same underlying evidence — claim + warrant + lifecycle, ranked, deduped per §3.5a. The performance gap (+8 pts) is not explained by summary compression; it is the absence of stale-reconstruction noise.
For this dataset, belief state was sufficient. Adding execution-time scratchpad (Arm C) did not recover any case that the overlay alone missed. Grounding-bankruptcy candidates = 0.
The v0.2.2 result extends from inspection to planning. v0.2.2 showed smallest-overlay-wins on inspection-side workflow-state questions. v0.3 shows overlay-only-wins on planning-side single-next-action questions. The architectural claim is the same in both: maintained state is the load-bearing object; raw context is the reconstruction cost.

Does not claim

Not “agents can plan from belief state alone in general.” This is a 75-question, single-substrate, single-model, single-task-type result. Other planning surfaces (multi-step plans, novel domains, longer horizons, different models) are not yet measured.
Not “scratchpad is always unhelpful.” Arm C still beat Arm A by 4 points; the scratchpad isn’t actively harmful, just outperformed by the cleaner overlay-only arm on this surface.
Not “Belief Stack is production-ready.” v0.3 is run-complete and locked. The runtime sidecar that would maintain this state at inference time is sketched (TKOS-001 / TKOS-002) but not yet implemented past the read-path slice.
Not “this generalizes across model families.” Same gpt-4o family as v0.1 / v0.2.2 by design; cross-model validation is a separate experiment.

Design implications

The architectural claim is now evidence-backed

The phrase “maintained state is a planning primitive” is no longer rhetorical. v0.3 demonstrates it on a bounded surface with a clean comparison: when a model has maintained belief state available, it does not need to reconstruct that state from raw history — and it answers planning questions more correctly, faster, and on fewer tokens.

The two-surface framing strengthens, not narrows

v0.3 makes the AI side measurable. The substrate is unchanged. The same belief state that v0.3’s Arm B reads through overlay() is what a human reads through state() / timeline() / explain(). One substrate. Two peer query surfaces. v0.3 just supplies the empirical case for the agent side.

Scratchpad is not the obvious next step

The pre-registration framed Arm C as “the realistic production architecture.” v0.3’s result invites a reconsideration: for bounded planning on this substrate, the pure-overlay architecture is strictly better. Scratchpad may turn out to be necessary for execution (where the agent has to use specific filenames, line numbers, exact errors) — but not for planning (where the agent has to decide whether and what to do next). The state/scratchpad distinction sharpens: scratchpad is for action, not for decision.

The “context compression” framing is still misleading

Even with this strong result, the value proposition is not “we compress the context.” It is:

Maintained state replaces the work the model would otherwise do implicitly — reconstruct what’s true now, given the history — with an explicit, inspectable object the model can read from directly.

That difference shows up in three dials simultaneously: planning correctness ↑, tokens ↓, latency ↓. A naive summary compresses tokens but does not give the model anything to read from instead of reconstruct. Belief state does.

v0.4 directions

In rough priority order, each motivated by something v0.3 surfaced:

Multi-step planning. v0.3 tests single-next-action. The next experiment should test whether belief state supports a planning task that requires multiple decisions in sequence, with state evolution between them.
Cross-model validation. Same v0.3 experiment, different generator families (Claude, Gemini, smaller models). Tests whether the “model was overburdened by reconstruction” effect is gpt-4o-specific or a general property of how LLMs handle workflow state.
The scratchpad-vs-belief question, properly. v0.3’s surprising B > C result deserves its own experiment. Tasks where scratchpad-level detail is genuinely needed for correctness; test whether overlay + scratchpad recovers cases overlay alone misses.
Runtime sidecar build. The TKOS-002 read-path slice demonstrated the dual-consumer substrate at fixture level. v0.4 candidate: extend with a write-path rule engine that derives belief instances from a real event stream.
Planning-quality LLM judge for explanation. A secondary outcome — does the answer cite the right belief? Tests whether the model is grounded in the maintained state versus happening to produce the right answer for the wrong reasons.

Closing

The recent log shows what happened.

The belief state shows what the system was relying on.

v0.3 adds the planning-side line:

An agent given the belief state directly does not need to reconstruct what’s true — and answers more correctly, faster, and on a fraction of the tokens.

Every agentic system today pays the reconstruct-world-model-every-step tax. v0.3 is the first direct measurement of what that tax costs, and the first evidence that a maintained belief state can lift it.

The narrow claim survives at maximum strength: maintained belief state with provenance and lifecycle is a planning primitive distinct from context summarization, from agent memory, and from log-based observability. It is its own object.

Follow the research

Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.

I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.