topicspace.ai · research · case study

Belief Stack v0.4c1 — cross-model replication

Maintained state beats raw reconstruction on every one of four tested models across three providers. The compression-vs-substrate isolation narrows to model-dependent.

locked · run-complete2026-06-05 · v0.4c1.1

Pre-registration: /research/pre-registrations/belief-stack-v0-4c1
Predecessors: v0.3 (planning-side baseline), the Belief Stack spec. v0.4a (mechanism ablation) and v0.4a.2 (compression control) ran on a single model; v0.4c1 holds the substrate constant and varies the generator across four models from three providers.

Headline

99.0% correctness at 241 mean input tokens, vs 89.3% at 2,502 tokens for raw history.
Across four LLMs from three providers, sparse substrate-derived maintained-state projections used roughly one-tenth the input tokens of raw history while improving planning correctness. Compressed raw history (Arm A′) reached 93.3% at 358 tokens.

On the same 75 paired single-next-action planning questions used in v0.3 and v0.4a, four LLMs from three providers (gpt-4o-2024-08-06, claude-opus-4-7, gemini-2.5-pro, claude-haiku-4-5-20251001) each consumed all four arms. 1,200 cells total, zero failures. The cross-model averages:

Arm	Mean input tokens	% of A	Planning correctness
A — raw K=20 log + strong baseline	2,502	100%	89.3%
A′ — LLM-compressed raw log	358	14%	93.3%
B — LLM-summarized substrate	316	13%	97.3%
C — sparse `belief_type :: claim`	241	10%	99.0%

Every maintained-state arm improved over raw history directionally on every model. The separation from compressed raw history was model-dependent: it held clearly for Opus, partially for GPT-4o, failed on Gemini Pro with thinking, and reversed on Haiku’s raw-log baseline. The result supports maintained state over raw reconstruction, while narrowing the compression-control claim.

What v0.4c1 tested

v0.3 measured a maintained-state projection (Arm B) beating raw context (Arm A) by 8 percentage points on one model. v0.4a.1 ran a five-arm mechanism ladder on the same model and found that the planning-side win did not come from rendering richer warrant or lifecycle metadata in the planner’s context — the most stripped-down projection (Arm C, bare belief_type :: claim) was Pareto-dominant. v0.4a.2 added a compression control (Arm A′) and found that LLM compression of raw context alone did not recover the lift on that one model.

The natural next reviewer question: is this a single-model artifact? v0.4c1 holds the substrate, the evaluation set, the prompts, and the scoring methodology constant, and varies only the generator across four models from three providers.

The four-arm design (no D, no E)

The mechanism ladder from v0.4a is not re-run cross-model. v0.4c1 runs the subset that tests the thesis:

Arm A — raw K=20 log + strong baseline reconstruction prompt
Arm A′ — LLM compression of the raw K=20 log at the same ~285-token budget as Arm B (compression control)
Arm B — LLM prose summary of the §3.5a-deduped substrate at the same budget
Arm C — bare belief_type :: claim per cluster — the Pareto reference from v0.4a

The four models

Model	Provider	Locked config
gpt-4o-2024-08-06	OpenAI	T=0, seed=20260601, full v0.4a parity
claude-opus-4-7	Anthropic	API rejects `temperature` parameter; default sampling; no seed
gemini-2.5-pro	Google	T=0, seed=20260601, `thinking_budget=2048` (required by API)
claude-haiku-4-5-20251001	Anthropic	T=0; no seed

Two provider-specific behaviors surfaced at build-time — Opus’s temperature rejection and Gemini’s thinking-mode requirement — and were honored by an explicit re-lock from v0.4c1 to v0.4c1.1 before any data flowed. Anti-curation discipline was preserved throughout: all 1,200 contexts generated before any answer-generation calls, all 1,200 answers generated before any judge calls, (qid, arm, model) tuples shuffled with a fixed seed.

The judge stays constant

The LLM judge (gpt-5-mini-2025-08-07, reasoning effort medium, fixed seed) is held constant across all four phases of the program. Only the generator varies in v0.4c1. The deterministic oracle does the primary scoring under oracle-wins-on-disagreement policy.

Per-model results

Paired n=75 per (model, arm).

Model	A	A′	B	C
gpt-4o-2024-08-06	89.3%	93.3%	96.0%	100.0%
claude-opus-4-7	88.0%	92.0%	98.7%	98.7%
gemini-2.5-pro	82.7%	94.7%	94.7%	97.3%
claude-haiku-4-5-20251001	97.3%	93.3%	100.0%	100.0%

Every B−A and C−A delta is positive. Maintained state beats raw history on every model tested. The magnitudes vary from +2.7 pp (Haiku) to +14.7 pp (Gemini Arm C).

Arm C as the cross-model Pareto reference

v0.4a identified Arm C as Pareto-dominant on one model. v0.4c1 replicates that dominance across the model field:

gpt-4o: Arm C is strictly Pareto-dominant (highest correctness, fewest input tokens, lowest wall)
Opus: Arm C ties Arm B on correctness (98.7%) but uses fewer tokens
Gemini: Arm C is strictly Pareto-dominant on correctness (97.3% vs B’s 94.7%) and ties B on tokens
Haiku: Arm C ties Arm B on correctness (100.0%) with fewer tokens

Arm C is the Pareto winner or tied for Pareto winner on all four models. The minimum-sufficient-state claim from v0.4a generalizes cleanly across the model field.

The pre-registered classifier — what survives and what narrows

v0.4c1 pre-registered both a per-model classifier (five outcome classes) and a cross-model classifier (does the v0.4a.2 compression-control finding generalize?). The classifier was locked before any data flowed and run mechanically against the per-model deltas.

Per-model outcomes

Model	Class	Label
gpt-4o-2024-08-06	2	partial replication
claude-opus-4-7	1	full replication
gemini-2.5-pro	3	compression equivalent
claude-haiku-4-5-20251001	0	unclassified

Cross-model classifier outcome

The classifier returned:

compression_finding_does_not_generalize

Triggered by Gemini’s class-3 result. The locked action commitment from §7 of the pre-registration:

“The maintained-state-vs-raw lift holds across models, but compression-vs-substrate isolation depends on model behavior.”

The classifier did its job: forcing an honest caveat. On Gemini, A′ reaches 94.7% — identical to B at 94.7%. On Haiku, A′ (93.3%) is actually worse than A (97.3%) — Haiku appears to lose information when summarizing its own raw log. The cross-model isolation of the v0.4a.2 compression-control finding does not hold cleanly. The thesis (maintained state beats raw reconstruction) does.

The research arc — surviving three pre-registered challenges

The empirical contribution is best read not as a single headline number but as an arc of pre-registered challenges, each of which gave the thesis an explicit opportunity to die.

v0.4a challenged the spec’s discipline as the mechanism. Result: the warrant/lifecycle rendering in the planner’s context was not load-bearing; the upstream substrate transformation was.
v0.4a.2 challenged the compression-of-raw alternative. Result: on gpt-4o, compression alone did not recover the lift.
v0.4c1 challenged the model-specific-artifact alternative. Result: the thesis held on all four tested models; the compression-vs-substrate isolation narrowed to model-dependent.

After three independent opportunities to collapse under pre-registered alternative hypotheses, the surviving statement is:

Maintained state beats raw reconstruction across all four tested models, on this substrate, at this budget.

The credibility of this statement comes from the survival pattern — the thesis was given multiple chances to die and didn’t — not from the magnitude of the headline number.

A second evidentiary line: Gemini thinking telemetry

Gemini Pro is the only thinking-mode model in the set. Its internal thinking tokens vary by arm:

Arm	Mean thinking tokens (Gemini)
A — raw K=20 log	954
A′ — LLM-compressed raw log	628
B — LLM-summarized substrate	713
C — sparse structured state	691

Gemini spent the most thinking tokens on the raw-history arm and 25–30% fewer thinking tokens on the projected arms. This is a single-model observation, not a controlled comparison. It is consistent with the reconstruction-tax framing — when context already projects current state, the model spends less time internally reconstructing it.

A working hypothesis (not a finding from this experiment): a thinking phase may functionally substitute for some of the substrate transformation. Where a non-thinking model needs the rule-engine-derived view to surface what is currently true, a thinking model can reconstruct from a less-structured prose summary during its thinking phase. The reconstruction tax then shifts from context-time to inference-time rather than disappearing.

Resolving this hypothesis requires a Gemini-thinking-budget ablation, or a within-family comparison to gemini-2.5-flash (non-thinking). That is adjacent future work, not within v0.4c1 scope.

What this proves — and does not

Claims (load-bearing)

Maintained state beats raw context across four LLM models from three providers. Every B−A and C−A delta is positive on every model. The cross-model averages: 99.0% (Arm C) and 97.3% (Arm B) vs 89.3% (Arm A).
Sparse structured state is the cross-model Pareto reference. Arm C is the Pareto winner or tied for winner on all four models, on both correctness and input-token cost. The minimum-sufficient-state finding from v0.4a generalizes.
The thesis survived three pre-registered alternative hypotheses. v0.4a (mechanism), v0.4a.2 (compression), v0.4c1 (model generality) each gave the thesis a chance to collapse. It didn’t.

Narrows

Compression-vs-substrate isolation is model-dependent. v0.4a.2 measured a clean separation between substrate-derived and compression-of-raw arms on gpt-4o. v0.4c1 shows this separation holds clearly on Opus, partially on GPT-4o, fails on Gemini Pro with thinking, and reverses on Haiku’s raw-log baseline. The substrate-machinery-is-load-bearing claim survives at the planning level; the compression-control isolation narrows.

Does not claim

Not “maintained state beats raw context in general.” This is a four-model, one-substrate, one-budget result on a single-next-action planning task. The single-substrate caveat is the largest remaining live risk and is what cross-substrate replication (v0.4c2) will address.
Not “Gemini’s thinking phase compensates for compression.” That is a working hypothesis consistent with the thinking-token telemetry, not a finding. The Gemini-thinking ablation is required to test it directly.
Not “richer projections never help.” v0.4c1 ran the four-arm subset (A / A′ / B / C); the full mechanism ladder (Arms D and E) was not re-run cross-model. Whether warrant or lifecycle rendering gains value on other models is unmeasured.
Not “Belief Stack is production-ready.” v0.4c1 is run-complete and locked. Live extraction, multi-step planning, and end-to-end maintenance economics remain open questions.

Design implications

The cross-model picture

The phrase “maintained state is a planning primitive” now has three independent empirical supports across four LLMs. The substrate transformation appears load-bearing on non-thinking models. The bare-name projection format is the cross-model Pareto reference. The compression-control finding narrows to model-dependent, but the planning lift over raw history is universal across the tested model field.

Arm C is the projection format the spec should center

The minimum-sufficient projection — bare belief_type :: claim, dedup-ranked and budget-bounded — is the reference shape the spec should describe as the planner-facing default. Warrant fields and lifecycle markers belong in the substrate (where rich content is maintained) and in the human-inspection surface (where structure is read for trust), not in the bare planner projection.

The thinking-mode question is real

Gemini Pro’s compression-equivalence outcome is the most interesting model-specific narrowing the program has surfaced. It is consistent with — and the cleanest signal toward — a hypothesis that thinking phases functionally substitute for some of the substrate transformation by moving the reconstruction tax from context-time to inference-time. This is a research question worth its own experiment.

The remaining publication gate

With cross-model replication complete, the one experiment that remains required for the paper’s strongest empirical claim is cross-substrate replication (v0.4c2). The protocol transfers cleanly; the bottleneck is data sourcing. After v0.4c2, the publication gate identified in the paper-scope-discipline memory is met.

Closing

v0.3 measured a single-model planning lift.

v0.4a showed the lift was not from projection-side discipline.

v0.4a.2 showed it was not from compression alone.

v0.4c1 shows it is not a single-model artifact.

The thesis survived a third pre-registered opportunity to collapse. Across four models from three providers, maintained state remains the strongest planning surface — at roughly one-tenth the input tokens of raw history.

Follow the research

Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.

I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.