topicspace.ai · research · pre-registration

Belief Stack v0.4c1 — pre-registration

Locked design document for the cross-model replication. Four models from three providers; four arms; per-model and cross-model outcome classifiers; locked action commitments. Amended to v0.4c1.1 for two provider-specific behaviors surfaced during build-time API verification — both before any data flowed.

LOCKED · run-complete2026-06-05 · v0.4c1.1

Companion run-complete report: /research/case-studies/belief-stack-v0-4c1
Predecessors: v0.3 (planning-side experiment); v0.4a.1 (mechanism ablation) and v0.4a.2 (compression control), both run on gpt-4o-2024-08-06.

§0 The reviewer question this experiment addresses

v0.3 → v0.4a → v0.4a.2 ran on one model. Is this a single-model artifact?

The v0.3 / v0.4a / v0.4a.2 program established three pre-registered claims on gpt-4o-2024-08-06: maintained state beats raw context (v0.3); the projection-side discipline is not the mechanism (v0.4a.1); LLM compression alone does not recover the lift (v0.4a.2). The natural next reviewer question is whether the result is a single-model artifact.

v0.4c1 holds the substrate, the evaluation set, the prompts, and the scoring methodology constant, and varies only the generator across four models from three providers. The arm set is the four-arm subset (A / A′ / B / C) — sufficient to test the thesis without re-running the full five-arm mechanism ladder per model.

§1 Research question

Does the maintained-state-beats-raw-context result from v0.3 / v0.4a / v0.4a.2 replicate across LLM model families, and does the v0.4a.2 compression-vs-substrate isolation hold cross-model?

§2 Hypotheses

2.1 Primary (thesis)

The substrate-derived projections (Arm B and Arm C) will beat raw context (Arm A) by ≥ 3 percentage points on every tested model. The cross-model generality of the thesis is the headline test.

2.2 Secondary (mechanism subclaim)

The compression-control finding from v0.4a.2 — that LLM compression of raw log alone does not recover the maintained-state lift — will hold across models. Specifically, Arm B will beat Arm A′ by ≥ 3 percentage points on every tested model.

2.3 Scope discipline (bounded claim)

Bounded

Not: maintained state beats raw context in general.

But: on four LLMs from three providers, on this single substrate, at this single budget, the thesis from v0.3 / v0.4a replicates — and the v0.4a.2 compression-control finding either holds or narrows in a measurable, characterizable way.

§3 Arms and models

3.1 The four arms (no D, no E)

Arm	Source	Format
A	Raw K=20 session log	Raw chronological text + strong baseline reconstruction prompt
A′	Raw K=20 session log	LLM compression of the raw log at the same ~285-token budget as Arm B (compression control)
B	§3.5a-deduped substrate	LLM prose summary of the maintained substrate at the same budget
C	§3.5a-deduped substrate	Bare `belief_type :: claim` per cluster (the Pareto reference from v0.4a)

Arms D and E from the v0.4a mechanism ladder are not re-run cross-model. The v0.4a.1 finding (E ≈ B at the 285-token budget on a single model) is reported as substrate- and model-specific until a separate cross-model mechanism study replicates it. v0.4c1 tests the thesis, not the full ladder.

3.2 The four models

Model	Provider	Locked config (v0.4c1.1)
gpt-4o-2024-08-06	OpenAI	T=0, seed=20260601, full v0.4a parity
claude-opus-4-7	Anthropic	API rejects `temperature` parameter; uses default sampling; no seed (Anthropic API does not support deterministic seed)
gemini-2.5-pro	Google	T=0, seed=20260601, `thinking_budget=2048` (Gemini Pro requires thinking mode); `max_output_tokens` includes thinking tokens — effective output cap is set as `thinking_budget + content_budget`
claude-haiku-4-5-20251001	Anthropic	T=0; no seed

3.3 The v0.4c1 → v0.4c1.1 amendment

During build-time API verification, two provider-specific behaviors surfaced that were not anticipated by the v0.4c1 lock. Both required explicit re-locks before any data flowed.

Anthropic Opus 4.7 rejects the temperature parameter. The API returned a 400 error. Resolution: conditionally omit the parameter for Opus only (Haiku still accepts it). The cross-model parity is honest but not perfect; the divergence is documented and recorded in the per-cell audit.
Gemini Pro requires thinking mode. Setting thinking_budget=0 returned a 400 error. Resolution: keep Gemini Pro with thinking_budget=2048 as a documented paper limitation. The thinking-mode confound is surfaced explicitly in §9 (limitations of the cross-model claim).

The discipline is: build-time discoveries become explicit re-locks, not silent reconciliations in code. The amendment log lives in the storm repo at belief_stack_v0_4c1/BELIEF_STACK_PRE_REGISTRATION_v0.4c1.md.

3.4 Anti-curation discipline

Constraint

All 1,200 contexts (75 × 4 × 4) generated before any answer-generation calls flow. All 1,200 answers generated before any judge calls flow. (qid, arm, model) tuples shuffled with fixed seed for both generation and judging, so no cell completes before another starts.

The same discipline that carried through v0.3 and v0.4a applies unchanged to v0.4c1, scaled to four models.

§4 Substrate and evaluation set

Reuse v0.1 / v0.2.2 / v0.3 / v0.4a substrate unchanged. 164 Claude Code session logs (~20,190 evaluation turns), 13,481 belief instances derived by the v0.1 rule engine (fixtured), 75 paired single-next-action planning questions across five categories.

The deterministic oracle (score_operational_label.Scorer) provides ground truth per (session, turn, category). The LLM judge (gpt-5-mini-2025-08-07, reasoning_effort=medium, seed=20260601) classifies generator answers under oracle-wins-on-disagreement policy. The judge is held constant across all four phases of the program; only the generator varies.

§5 Primary outcome measure

Planning correctness per (model, arm) cell — the fraction of paired questions on which the generator’s answer does not commit the category-relevant failure mode. Paired n=75 per cell; 300 cells per arm; 1,200 cells total.

Secondary measures: input tokens, output tokens, wall time, judge↔oracle conflict rate. For Gemini specifically: mean thinking tokens per arm (reported as telemetry, not as a controlled comparison).

§6 Effect-size thresholds (locked)

≥ 3 percentage points: a meaningful effect (carries forward from v0.4a)
≤ 2 percentage points: noise floor (carries forward from v0.4a)

Effect sizes between 2 and 3 percentage points are reported as directional but not significant. The thresholds were calibrated to v0.4a effect sizes; smaller true effects on stronger raw baselines can fall below them without invalidating directional claims (this is the explicit caveat for the Haiku case in the run-complete report).

§7 Pre-registered outcome classifier (locked)

7.1 Per-model classifier (five classes)

Class	Label	Trigger
1	full replication	B and C both beat A and A′ by ≥ 3 pp
2	partial replication	Only one of B or C clears the ≥ 3 pp threshold above both A and A′
3	compression equivalent	A′ reaches B’s correctness (within the 2-pp noise floor) while A′ itself beats A by ≥ 3 pp. v0.4a.2 compression-control does not isolate on this model.
4	no effect	No arm clears any threshold above another; the projection effort is wasted on this model.
5	reversal	A beats B or C by ≥ 3 pp; the thesis fails on this model.
0	unclassified	Deltas don’t fit any class above. Reported honestly with no forced class assignment.

7.2 Cross-model classifier

Outcome	Trigger	Locked action commitment
thesis_replicates_universally	All models class 1 or 2; no model class 3, 4, or 5	Paper section: “maintained state beats raw context across LLM families; compression alone does not.”
compression_finding_does_not_generalize	≥ 1 model class 3 (compression equivalent)	Paper section: “maintained state beats raw context across models; compression-vs-substrate distinguishes only on some models.”
thesis_partially_generalizes	≥ 1 model class 4 or 5; majority class 1 or 2	Paper section: “maintained state beats raw context on most but not all tested models; report per-model.”
thesis_does_not_generalize	Majority class 4 or 5	Significant retraction. Paper retitled around the model-specific surface where the effect holds.

The discipline

The classifier is run mechanically against the per-model deltas after data flows. The action commitment is honored in the run-complete report and in the paper iteration, not negotiated after the fact. The classifier is sized for a specific binary question (does v0.4a.2 generalize?) and does not capture everything the data may also show; that nuance is what the run-complete report adds in interpretation.

§8 What this would prove

If the cross-model classifier returns thesis_replicates_universally, the v0.3 / v0.4a / v0.4a.2 program has both the planning lift and the compression-control isolation as cross-model phenomena. The publication-gate cross-model box is checked maximally.

If the classifier returns compression_finding_does_not_generalize or thesis_partially_generalizes, the thesis is preserved cross-model while one of the subclaims narrows. Either outcome supports publication with a more carefully worded mechanism section.

§9 What this would not prove

This experiment does not test cross-substrate generality (v0.4c2 scope). The result is bounded to operational workflow substrate (Claude Code session logs). Substrates with different properties — sensemaking, multi-actor coordination, longer planning horizons — may exhibit different trade-offs.

It does not control for thinking-mode behavior on Gemini Pro. The Gemini class-3 outcome (if it fires) will be consistent with — but not isolate — a hypothesis that thinking phases functionally substitute for substrate transformation. Resolving that requires a separate Gemini-thinking-budget ablation or a within-family comparison to gemini-2.5-flash.

It does not re-run the full mechanism ladder (Arms D and E) cross-model. Whether projection-side discipline gains value on other models is unmeasured.

It does not measure end-to-end maintenance economics (substrate-side write-path cost). v0.4c1 is a planning-side measurement only.

§10 Locked language

Use	Avoid
cross-model replication	cross-model proof / universal validation
the substrate transformation appears load-bearing	the substrate transformation is the mechanism
compression-vs-substrate isolation is model-dependent	compression doesn’t work
consistent with the reconstruction-tax framing	proof of reconstruction tax
working hypothesis (not a finding)	demonstrates / confirms
thesis survived a pre-registered challenge	thesis was validated

The discipline is conservative-by-default. Single-observation telemetry (Gemini thinking tokens) is framed as “consistent with” not “proof of.” The compression-control narrowing is acknowledged explicitly when it fires, not minimized. The credibility of the program rests on these word choices.

§11 Run plan (locked)

Build all 1,200 contexts. Per-provider dispatchers for OpenAI, Anthropic, Google. Arms A and C are deterministic from substrate; Arms A′ and B require per-model summarization. No answer-generation calls flow until all contexts are persisted.
Generate all 1,200 answers. Shuffled(qid, arm, model) order with seed 20260601. Resumable per cell.
Score all 1,200 cells. Deterministic oracle plus LLM judge under oracle-wins-on-disagreement policy. Same judge config as v0.3 and v0.4a.
Run the classifier. Mechanical, against the per-model deltas. Outcome triggers the locked action commitment.
Write the run-complete report. Lead with the cross-model Pareto headline. Honor the classifier’s caveat in interpretation. Add the Gemini thinking telemetry as a second evidentiary line with appropriate epistemic framing.
Iterate the paper to v0.3. Integrate the cross-model phase as a new §4.4. Update limitations and required next experiments. The paper title remains Maintained State as a Planning Primitive — the paper is about a result, not an architecture.

Follow the research

Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.

I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.