Belief Stack v0.4c1 — cross-model replication
Maintained state beats raw reconstruction on every one of four tested models across three providers. The compression-vs-substrate isolation narrows to model-dependent.
Predecessors: v0.3 (planning-side baseline), the Belief Stack spec. v0.4a (mechanism ablation) and v0.4a.2 (compression control) ran on a single model; v0.4c1 holds the substrate constant and varies the generator across four models from three providers.
Headline
Across four LLMs from three providers, sparse substrate-derived maintained-state projections used roughly one-tenth the input tokens of raw history while improving planning correctness. Compressed raw history (Arm A′) reached 93.3% at 358 tokens.
On the same 75 paired single-next-action planning questions used in v0.3 and v0.4a, four LLMs from three providers (gpt-4o-2024-08-06, claude-opus-4-7, gemini-2.5-pro, claude-haiku-4-5-20251001) each consumed all four arms. 1,200 cells total, zero failures. The cross-model averages:
| Arm | Mean input tokens | % of A | Planning correctness |
|---|---|---|---|
| A — raw K=20 log + strong baseline | 2,502 | 100% | 89.3% |
| A′ — LLM-compressed raw log | 358 | 14% | 93.3% |
| B — LLM-summarized substrate | 316 | 13% | 97.3% |
C — sparse belief_type :: claim | 241 | 10% | 99.0% |
Every maintained-state arm improved over raw history directionally on every model. The separation from compressed raw history was model-dependent: it held clearly for Opus, partially for GPT-4o, failed on Gemini Pro with thinking, and reversed on Haiku’s raw-log baseline. The result supports maintained state over raw reconstruction, while narrowing the compression-control claim.
What v0.4c1 tested
v0.3 measured a maintained-state projection (Arm B) beating raw context (Arm A) by 8 percentage points on one model. v0.4a.1 ran a five-arm mechanism ladder on the same model and found that the planning-side win did not come from rendering richer warrant or lifecycle metadata in the planner’s context — the most stripped-down projection (Arm C, bare belief_type :: claim) was Pareto-dominant. v0.4a.2 added a compression control (Arm A′) and found that LLM compression of raw context alone did not recover the lift on that one model.
The natural next reviewer question: is this a single-model artifact? v0.4c1 holds the substrate, the evaluation set, the prompts, and the scoring methodology constant, and varies only the generator across four models from three providers.
The four-arm design (no D, no E)
The mechanism ladder from v0.4a is not re-run cross-model. v0.4c1 runs the subset that tests the thesis:
- Arm A — raw K=20 log + strong baseline reconstruction prompt
- Arm A′ — LLM compression of the raw K=20 log at the same ~285-token budget as Arm B (compression control)
- Arm B — LLM prose summary of the §3.5a-deduped substrate at the same budget
- Arm C — bare
belief_type :: claimper cluster — the Pareto reference from v0.4a
The four models
| Model | Provider | Locked config |
|---|---|---|
| gpt-4o-2024-08-06 | OpenAI | T=0, seed=20260601, full v0.4a parity |
| claude-opus-4-7 | Anthropic | API rejects temperature parameter; default sampling; no seed |
| gemini-2.5-pro | T=0, seed=20260601, thinking_budget=2048 (required by API) | |
| claude-haiku-4-5-20251001 | Anthropic | T=0; no seed |
Two provider-specific behaviors surfaced at build-time — Opus’s temperature rejection and Gemini’s thinking-mode requirement — and were honored by an explicit re-lock from v0.4c1 to v0.4c1.1 before any data flowed. Anti-curation discipline was preserved throughout: all 1,200 contexts generated before any answer-generation calls, all 1,200 answers generated before any judge calls, (qid, arm, model) tuples shuffled with a fixed seed.
The judge stays constant
The LLM judge (gpt-5-mini-2025-08-07, reasoning effort medium, fixed seed) is held constant across all four phases of the program. Only the generator varies in v0.4c1. The deterministic oracle does the primary scoring under oracle-wins-on-disagreement policy.
Per-model results
Paired n=75 per (model, arm).
| Model | A | A′ | B | C |
|---|---|---|---|---|
| gpt-4o-2024-08-06 | 89.3% | 93.3% | 96.0% | 100.0% |
| claude-opus-4-7 | 88.0% | 92.0% | 98.7% | 98.7% |
| gemini-2.5-pro | 82.7% | 94.7% | 94.7% | 97.3% |
| claude-haiku-4-5-20251001 | 97.3% | 93.3% | 100.0% | 100.0% |
Every B−A and C−A delta is positive. Maintained state beats raw history on every model tested. The magnitudes vary from +2.7 pp (Haiku) to +14.7 pp (Gemini Arm C).
Arm C as the cross-model Pareto reference
v0.4a identified Arm C as Pareto-dominant on one model. v0.4c1 replicates that dominance across the model field:
- gpt-4o: Arm C is strictly Pareto-dominant (highest correctness, fewest input tokens, lowest wall)
- Opus: Arm C ties Arm B on correctness (98.7%) but uses fewer tokens
- Gemini: Arm C is strictly Pareto-dominant on correctness (97.3% vs B’s 94.7%) and ties B on tokens
- Haiku: Arm C ties Arm B on correctness (100.0%) with fewer tokens
Arm C is the Pareto winner or tied for Pareto winner on all four models. The minimum-sufficient-state claim from v0.4a generalizes cleanly across the model field.
The pre-registered classifier — what survives and what narrows
v0.4c1 pre-registered both a per-model classifier (five outcome classes) and a cross-model classifier (does the v0.4a.2 compression-control finding generalize?). The classifier was locked before any data flowed and run mechanically against the per-model deltas.
Per-model outcomes
| Model | Class | Label |
|---|---|---|
| gpt-4o-2024-08-06 | 2 | partial replication |
| claude-opus-4-7 | 1 | full replication |
| gemini-2.5-pro | 3 | compression equivalent |
| claude-haiku-4-5-20251001 | 0 | unclassified |
Cross-model classifier outcome
compression_finding_does_not_generalize
Triggered by Gemini’s class-3 result. The locked action commitment from §7 of the pre-registration:
“The maintained-state-vs-raw lift holds across models, but compression-vs-substrate isolation depends on model behavior.”
The classifier did its job: forcing an honest caveat. On Gemini, A′ reaches 94.7% — identical to B at 94.7%. On Haiku, A′ (93.3%) is actually worse than A (97.3%) — Haiku appears to lose information when summarizing its own raw log. The cross-model isolation of the v0.4a.2 compression-control finding does not hold cleanly. The thesis (maintained state beats raw reconstruction) does.
The research arc — surviving three pre-registered challenges
The empirical contribution is best read not as a single headline number but as an arc of pre-registered challenges, each of which gave the thesis an explicit opportunity to die.
- v0.4a challenged the spec’s discipline as the mechanism. Result: the warrant/lifecycle rendering in the planner’s context was not load-bearing; the upstream substrate transformation was.
- v0.4a.2 challenged the compression-of-raw alternative. Result: on
gpt-4o, compression alone did not recover the lift. - v0.4c1 challenged the model-specific-artifact alternative. Result: the thesis held on all four tested models; the compression-vs-substrate isolation narrowed to model-dependent.
After three independent opportunities to collapse under pre-registered alternative hypotheses, the surviving statement is:
The credibility of this statement comes from the survival pattern — the thesis was given multiple chances to die and didn’t — not from the magnitude of the headline number.
A second evidentiary line: Gemini thinking telemetry
Gemini Pro is the only thinking-mode model in the set. Its internal thinking tokens vary by arm:
| Arm | Mean thinking tokens (Gemini) |
|---|---|
| A — raw K=20 log | 954 |
| A′ — LLM-compressed raw log | 628 |
| B — LLM-summarized substrate | 713 |
| C — sparse structured state | 691 |
Gemini spent the most thinking tokens on the raw-history arm and 25–30% fewer thinking tokens on the projected arms. This is a single-model observation, not a controlled comparison. It is consistent with the reconstruction-tax framing — when context already projects current state, the model spends less time internally reconstructing it.
A working hypothesis (not a finding from this experiment): a thinking phase may functionally substitute for some of the substrate transformation. Where a non-thinking model needs the rule-engine-derived view to surface what is currently true, a thinking model can reconstruct from a less-structured prose summary during its thinking phase. The reconstruction tax then shifts from context-time to inference-time rather than disappearing.
Resolving this hypothesis requires a Gemini-thinking-budget ablation, or a within-family comparison to gemini-2.5-flash (non-thinking). That is adjacent future work, not within v0.4c1 scope.
What this proves — and does not
Claims (load-bearing)
- Maintained state beats raw context across four LLM models from three providers. Every B−A and C−A delta is positive on every model. The cross-model averages: 99.0% (Arm C) and 97.3% (Arm B) vs 89.3% (Arm A).
- Sparse structured state is the cross-model Pareto reference. Arm C is the Pareto winner or tied for winner on all four models, on both correctness and input-token cost. The minimum-sufficient-state finding from v0.4a generalizes.
- The thesis survived three pre-registered alternative hypotheses. v0.4a (mechanism), v0.4a.2 (compression), v0.4c1 (model generality) each gave the thesis a chance to collapse. It didn’t.
Narrows
- Compression-vs-substrate isolation is model-dependent. v0.4a.2 measured a clean separation between substrate-derived and compression-of-raw arms on
gpt-4o. v0.4c1 shows this separation holds clearly on Opus, partially on GPT-4o, fails on Gemini Pro with thinking, and reverses on Haiku’s raw-log baseline. The substrate-machinery-is-load-bearing claim survives at the planning level; the compression-control isolation narrows.
Does not claim
- Not “maintained state beats raw context in general.” This is a four-model, one-substrate, one-budget result on a single-next-action planning task. The single-substrate caveat is the largest remaining live risk and is what cross-substrate replication (v0.4c2) will address.
- Not “Gemini’s thinking phase compensates for compression.” That is a working hypothesis consistent with the thinking-token telemetry, not a finding. The Gemini-thinking ablation is required to test it directly.
- Not “richer projections never help.” v0.4c1 ran the four-arm subset (A / A′ / B / C); the full mechanism ladder (Arms D and E) was not re-run cross-model. Whether warrant or lifecycle rendering gains value on other models is unmeasured.
- Not “Belief Stack is production-ready.” v0.4c1 is run-complete and locked. Live extraction, multi-step planning, and end-to-end maintenance economics remain open questions.
Design implications
The phrase “maintained state is a planning primitive” now has three independent empirical supports across four LLMs. The substrate transformation appears load-bearing on non-thinking models. The bare-name projection format is the cross-model Pareto reference. The compression-control finding narrows to model-dependent, but the planning lift over raw history is universal across the tested model field.
Arm C is the projection format the spec should center
The minimum-sufficient projection — bare belief_type :: claim, dedup-ranked and budget-bounded — is the reference shape the spec should describe as the planner-facing default. Warrant fields and lifecycle markers belong in the substrate (where rich content is maintained) and in the human-inspection surface (where structure is read for trust), not in the bare planner projection.
The thinking-mode question is real
Gemini Pro’s compression-equivalence outcome is the most interesting model-specific narrowing the program has surfaced. It is consistent with — and the cleanest signal toward — a hypothesis that thinking phases functionally substitute for some of the substrate transformation by moving the reconstruction tax from context-time to inference-time. This is a research question worth its own experiment.
The remaining publication gate
With cross-model replication complete, the one experiment that remains required for the paper’s strongest empirical claim is cross-substrate replication (v0.4c2). The protocol transfers cleanly; the bottleneck is data sourcing. After v0.4c2, the publication gate identified in the paper-scope-discipline memory is met.
Closing
v0.3 measured a single-model planning lift.
v0.4a showed the lift was not from projection-side discipline.
v0.4a.2 showed it was not from compression alone.
v0.4c1 shows it is not a single-model artifact.
Follow the research
Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.
I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.