topicspace.ai · research · specification

The Belief Stack

A pattern for AI systems that need to track when assumptions remain valid under changing evidence.

v0.1 · proposed2026-06-04 · revised

Author: Susan Stranburg.  Status: open specification, early stage.  One schema (warrant) is pinned at v0.1; additional schemas listed below as forthcoming.
Planned canonical URL: https://topicspace.ai/research/belief-stack

Abstract

topicspace is developing Belief Stack — the maintained-state substrate between evidence and planning. An architectural pattern for building belief-state knowledge sources for LLM systems. The Belief Stack is not a framework, not a product, and not a specific implementation. It is a layered contract — what kind of object lives at each layer, what guarantees each layer makes to the layers above and below it, and what operating disciplines distinguish a working stack from a confident hallucination.

The architectural commitment is one substrate, asymmetric consumer projections. A lifecycle-aware substrate — claim + warrant + lifecycle, derived from the event stream and maintained as first-class objects — produces two consumer-specific projections: a sparse state projection for planners (what is currently true, dedup-ranked, budget-bounded) and a rich inspection surface for humans (full warrant chains, lifecycle timelines, audit). Same substrate. Different surfaces. Empirically different optimal shapes.

This spec defines the core primitives, the belief contract (claim + warrant + lifecycle), the operating disciplines, and the schemas needed to instantiate a Belief Stack in a new substrate. It treats specific usage patterns — Sensemaking, Reasoning-trace state management, and Stack-grounded intelligence — as empirical application surfaces, not as the spec itself. The strongest current evidence is in reasoning-trace state management: v0.3 measured an 8-point planning-correctness lift from substrate-derived projections over raw context, and v0.4a.2 ruled out compression as the explanation — LLM-summarized raw context at matched budget does not recover the lift. The substrate transformation is the load-bearing operation; richer projection rendering does not add measurable value at the tested budget on operational substrate. Earned governance is treated as a discipline applied to the optional GOV layer, not as a usage pattern of its own.

What is a belief?

A belief is not the same as truth. A belief can be true, false, partial, useful, harmful, outdated, vague, or implicit.

Fundamentally, a belief is a representation of how things are, or how they are likely to be, that a system is willing to use for interpretation or action.

A person may hold a belief. A group may hold a belief. A model, agent, workflow, or institution may operate from one. The belief does not need to be explicitly stated to matter. Often the most consequential beliefs are visible only through behavior: what the system attends to, what it ignores, what it predicts, and what it treats as safe to do next.

In this specification, a belief becomes operational only when it carries a warrant.

Informally:

belief = claim + authority to use the claim

Formally:

belief = claim + warrant + lifecycle

A claim says what is being asserted about the state of something. A warrant says why the claim is allowed to influence downstream reasoning, given the evidence and the rules for evaluating it. A lifecycle says whether the claim still holds over time — whether it has been reconfirmed, weakened, contradicted, or retired as evidence accumulates.

A label — for example validation_pending — is a handle for the claim, not the claim itself. The label names what kind of thing is being tracked; the claim states what is actually being asserted. Reducing a belief to a label is a category error that the rest of this specification is designed to avoid.

A concrete example, drawn from an assistant-workflow context:

label:     validation_pending
claim:     validation has not yet been observed for the current fix
warrant:   no successful validation tool output exists after the
           most recent fix_attempted in this session
lifecycle: active until a successful validation is observed
           (transition to validation_complete) or until the
           half-life elapses (retire under stale_decay)
authority: confirmed_by_tool | asserted_by_assistant | confirmed_by_user

The label tells you what is being tracked. The belief tells you what is currently being claimed, why, and whether it still holds. A Belief Stack maintains all four pieces as evidence changes over time.

A label is a handle. A belief is a warranted, maintained assertion.

Without a warrant, a claim is only an unattributed assertion. Without a claim, evidence is not yet a belief. Without lifecycle discipline, what would have been a belief becomes a frozen artifact — accurate at one moment and stale ever after.

Where this sits

A Belief Stack is the substrate layer between evidence and planning. The substrate is lifecycle-aware: it maintains claims with warrants and lifecycle as first-class objects, derived from the event stream by upstream machinery (filtering to active / weakened / contradicted, dedup-clustering by (type, claim), ranking by recency and authority). The substrate then produces consumer-specific projections: a sparse state projection for planners and a rich inspection surface for humans. Same substrate. Different surfaces. Empirically different optimal shapes (see Empirical status).

one substrate, asymmetric consumer projections

Belief Stack derives consumer-specific projections from a shared substrate. The planner consumes the sparse projection (bare structured names at a small token budget — v0.4a Arm C reached 97.3% planning correctness at ~208 tokens). The human consumes the rich inspection surface (full warrant chains, lifecycle timelines, audit trails). Same substrate; empirically different optimal projections.

The architectural claim

Maintained state is a planning primitive. One substrate, asymmetric consumer projections — sparse for planners, rich for humans. The substrate transformation (lifecycle filtering, dedup, ranking) is the load-bearing operation; v0.4a.2 ruled out compression of raw context as an alternative explanation. The two surfaces are paired consequences of placing belief state in the architectural slot between evidence and planning, not parallel design preferences.

Agent integration

The diagram above answers what Belief Stack is. The diagram below answers how it integrates with an agent harness. Belief Stack is not the planner and not the executor. It is the maintained-state substrate through which both interact with evolving evidence.

agent harness integration — forward flow + feedback loops

Forward path: evidence → substrate → projection → planner → executor. Return loops: executor outputs update the substrate as belief events (teal) and feed forward as new evidence (muted). The human inspection projection (not shown here) reads from the same substrate, as in the previous diagram.

The argument has five parts. Each part plays a distinct role — claim, named layer, runtime, mechanism, measured outcome — and the next section names them in the order a reader encounters them. Below that, a second table names the ways the pattern is used or governed.

How the pieces fit

Layer	Concept	Role
Thesis	Maintained state is a planning primitive.	The claim. Backed empirically by v0.3 — a compact maintained belief overlay outperformed larger raw-context bundles on planning tasks at a fraction of the tokens and latency.
Product	Belief Stack	This spec. The named architectural layer — claims with warrants and lifecycle, queryable by both humans and agents.
Architecture	TKOS	Temporal Knowledge Operating System — the runtime engineering that powers Belief Stack. Sub-50ms inline consultation, MCP gateway, deterministic fallback for uncalibrated queries. Implementation, outside this spec.
Belief state	Claims · warrants · lifecycle	What lives inside Belief Stack — the mechanism. Each belief carries provenance, confidence, applicability boundary, and a lifecycle stage (born → strengthened → weakened → contradicted → retired).
Result	Reduced reconstruction tax	The measured outcome. Agents stop re-deriving the world model from raw evidence at every step. Planning calls run against a maintained state instead of reconstructing one.

Earlier drafts of this work also used the broader framing REPI (Runtime Epistemic Infrastructure) as a proposed category direction. That framing remains useful in positioning conversations where the goal is to argue the category should exist at all, but it has been folded into the more concrete vocabulary above. This spec is self-contained and does not require REPI as a prerequisite.

Usage patterns and disciplines

Name	Role	Example
Sensemaking	Usage pattern — field-shape tracking across many observations over time.	Markets / AI ecosystem narrative intelligence (sensemaking-v1)
Reasoning-trace state management	Usage pattern — active-workflow warrant tracking inside a single ongoing reasoning trace.	Assistant log replay (tkos-log-replay-v1)
Stack-grounded intelligence	Usage pattern — LLM systems answer from maintained belief states and their warrants, rather than raw retrieved documents alone.	Belief-grounded RAG alternative (case study pending)
Earned governance	Discipline applied to the optional GOV layer — governance rules must prove measurable value before deployment.	See Earned Governance for Runtime LLM Intervention

The distinction

TKOS tracks whether a belief still has authority; governance decides whether that authority is enough to act.

A reader interested only in the architectural pattern can ignore REPI and the TKOS implementation entirely. This spec stands on its own. The three usage patterns and the earned-governance discipline are reported separately so each can be evaluated against its own substrate.

Empirical anchors

Two v1 case studies apply this specification to very different substrates. Reading the TopicSpace Belief Field applies the pattern to the TopicSpace sensemaking pipeline over 173 days, 42 actors, and 2,100 expectations (a soft / latent substrate). Watching an Assistant Forget applies the pattern to TKOS log-replay over 164 Claude sessions, 20,190 evaluation turns, and 11,262 belief instances (a typed / operational substrate), with preregistered v0.1 and v0.2 intervention catalogs and head-to-head TP/FP/FN/TN accounting.

The pattern at a glance

Layer capabilities — TL;DR

L0	— evidence retrieval
L1	— warranted organization
L2	— current priors / belief return
L3	— lifecycle and revision
L4	— calibration against outcomes
GOV	— optional downstream action discipline

A Belief Stack can function as a complete knowledge source through L0–L4. GOV is an optional consumer layer for systems that need action gating, intervention eligibility, or deployment discipline. The table below elaborates each layer; the rest of the spec defines the contracts.

The pattern operates over an evidence field — the space of observations relevant to a system's task. The layers organize how that field is interpreted, acted on, and revised over time. Each layer is a contract over the layer below.

Layer	Carries	Responsibility
L0 Evidence	Raw observations from the evidence field. Ordered, addressable, provenance-bearing.	Storage and provenance. No interpretation.
L1 Regions	Labels + warrants over L0	Partition input space into contextually meaningful regions. Each assignment carries the evidence backing it.
L2 Priors	Per-region expectations	What is likely / acceptable / typical inside a region. Predictions, distributions, invariants, thresholds.
L3 Lifecycle	State and trajectory of beliefs over time	Beliefs are born, age, are contradicted, and retire. L3 tracks where each belief sits in that lifecycle.
L4 Calibration	Predicted-vs-actual measurement	Continuous: did the priors hold up against held-out reality? Pushes back against stale L2 expectations.
GOV Governance	Intervention eligibility (optional)	Optional consumer layer. When present, an intervention may deploy only after its warrant survives a check against current evidence; no declared interventions. Systems that only need to serve maintained beliefs can omit GOV entirely.

Field assumptions

The pattern makes minimum assumptions about the evidence field, but it makes a few:

Ordinal. Observations have a defined order — usually temporal (a timestamp), sometimes a sequence position (turn index, processing step). Without an order, L3 lifecycle has no signal to track and warrant decay has no clock to decay against.
Identifiable. Each observation has a stable reference (row ID, JSON path, turn number) so L1 warrants can carry evidence_refs. Without this, invariant warrants cannot validate.
Provenance-bearing. Each observation knows its source. Without provenance, calibration cannot distinguish “this signal was wrong” from “this signal aged out.”

The field does not need to be structured, complete, or continuous. The spec only requires that what is observed is ordered, addressable, and attributed.

Worked examples

Two instantiations across very different substrates, illustrative not exhaustive:

Layer	Markets narrative intelligence	TKOS log-replay over assistant sessions
L0	Timestamped news + price events per actor	Timestamped assistant turns with role, tool calls, and tool results
L1	Learned narrative clusters from event embeddings	Seven typed operational regions (data fetch, pipeline run, failure diagnosis, validation, deploy readiness, report generation, evidence sealing)
L2	Per-cluster directional priors (narrative-vs-price divergence scores)	Per-region state-level beliefs with decaying warrants (eight beliefs: pipeline_running, fix_attempted, deploy_pending, etc.)
L3	Region lifecycle: stable / weakening / turbulent / recovering	Belief lifecycle: born / refreshed / contradicted / stale-decay-retired
L4	Walk-forward calibration of priors against realized price moves	TP / FP / FN / TN labeling against 5-turn-lookahead retrospective ground truth
GOV	Suppression of low-warrant signals from daily outputs	Intervention applicability gates against decayed belief authority at firing time

The two substrates instantiate the same six layers but with very different L1 representations (learned narrative clusters vs. typed operational regions) and different warrant shapes. The spec carries; the instantiation choices follow from substrate properties.

Empirical results are summarized later in Empirical status. Current evidence spans a latent narrative substrate (markets) and a typed operational substrate (assistant workflows), with mixed results that informed this revision.

Core primitives

The smallest set of definitions a Belief Stack implementation must honor. The first primitive, Belief, is the formal restatement of the plain-language definition above. Specific schemas follow in a later section.

Belief

Formal restatement of the section above. A maintained, warranted assertion. A belief carries a claim (what is being asserted about the state of something), a warrant (the evidence and the rules under which the claim is allowed to stand), and a lifecycle (whether the claim still holds, given the current evidence). A label may serve as the handle for a belief, but a belief is more than its label. A claim without a warrant is a bare claim; a warrant without a claim is unattributed evidence; a maintained claim without lifecycle discipline becomes a frozen artifact rather than an active belief.

Warrant

The evidence backing a label. The shape of a warrant varies by substrate (see Substrate-awareness below), but every warrant carries enough fields that a downstream consumer can verify whether the label is defensible. At minimum: a coverage status (does this label apply here?), a support count (how many observations back it?), and a confidence or distance measure.

Region

The L1 unit of organization. A region is a partition of input space defined either by learned structure (clustering, density estimation) or by a priori typology (named operation types, declared categorical schemes). A region carries its own characteristic warrants and priors.

Prior

The L2 unit of expectation. A prior is what the system expects to see given a region (and optionally a lifecycle state). Priors can be probabilistic (distributions over outcomes), structural (invariants that must hold), or behavioral (rules of engagement).

Lifecycle state

The L3 unit. A label of where a belief currently sits in its arc: born, active, strengthened, weakened, contradicted, retired. Lifecycle state mediates the authority of the belief in downstream reasoning.

The L1 representation contract

The most load-bearing rule in the spec, and the one most often violated in practice:

Every L1 assignment must carry both a label and a warrant. They must be emitted together, not assigned separately.

A classifier that emits labels without warrants is broken regardless of how confident the labels look. A classifier that emits labels alongside warrants — even imperfect warrants — is inspectable, falsifiable, and able to participate in downstream coverage checks. A common failure mode in assistant log-replay pipelines — labeling a turn as a “validation failure” without checking whether a validation belief was actually active and warranted at that moment — is a clean case of label-without-warrant production.

Three operating disciplines follow from the contract:

No label without warrant. Every region assignment must emit at least the minimal warrant fields. If the warrant cannot be produced, the assignment is UNCLASSIFIED, not a default class.
No priors outside coverage. If a region assignment's warrant indicates OUT_OF_DISTRIBUTION, no L2 prior may be emitted for that assignment. The correct output is silence, not a fallback global rate.
Nearest-neighbor assignment is not validity. Forcing every input into the closest trained region (the default behavior of K-means and similar) is not the same as saying the input belongs there. Coverage must be checked, not assumed.

Substrate-awareness

The Belief Stack pattern is invariant across domains. The L1 instantiation is not — it must match the substrate's underlying properties. Forcing a clustering approach onto a substrate with hard typed operations, or forcing a typed ontology onto a substrate with latent narrative drift, is a category error.

Substrate	L1 shape	Warrant fields that matter
Latent / continuous / drifting (markets, narratives, attention)	Learned regions via clustering on embeddings	distance_to_centroid, coverage_threshold, support_n
Trajectory-bearing / multi-role (multi-turn dialogue systems, multi-agent coordination traces)	Role-specific embeddings + drift state (a separable L1.2 axis)	distance_to_centroid, drift_state, role_view, turn_index
Typed / discrete / invariant-bearing (long-running assistant workflows, engineering pipeline operations)	A priori operational typology (named operation types)	evidence_refs, validation_status, inputs

Diagnostic principle

Before picking an L1 representation for a new substrate, ask: are the dimensions of variation latent or named? Are invariants soft (statistical) or hard (arithmetic/logical)? Is the substrate trajectory-bearing or static? Match the L1 shape to the answers. The discipline (claim + warrant + lifecycle, coverage gates) is what carries across; specific clustering / typology choices should be re-derived from substrate properties each time.

Applicability

The pattern earns its complexity in problems with all five of the following:

Events repeat. L1 needs enough observations to find regions.
Patterns are meaningful. If observations are essentially i.i.d. noise, regions are spurious and L2 has nothing real to predict from.
Expectations are possible. L2 requires that something can be predicted before it happens. If the question is purely what is true now?, this pattern is overkill.
Outcomes are measurable. L4 needs a signal to score predictions against. Without it, calibration is impossible.
Enough history exists. Walk-forward calibration needs a train/test split. A few weeks of data per region is the practical floor.

The pattern does not fit:

One-off decisions. Nothing to cluster, nothing to feed back. Use judgment.
Low-sample problems. Per-region priors become memorization, not prediction.
Pure taste / aesthetics. No outcome to score against; L4 has nothing to do.
Persuasion-driven problems. The goal is to change beliefs, not to track them. Different problem class.
Unreliable labels. L4 calibration only works when the outcome signal is trustworthy. Noisy or biased proxies make every belief look equally justified.

The honest version: the pattern is most useful when you are already going to have beliefs about something and want those beliefs checked against reality systematically. Where there is no belief, no check, or no reality, the architecture is just expensive structure.

The feedback loop

The Belief Stack is not only a forward path from evidence to belief return. It is a revision loop.

L0 observations are organized into warranted L1 assignments. L2 reads those assignments as priors or current belief returns. L3 tracks whether those beliefs are born, reinforced, weakened, contradicted, or retired over time. L4 measures whether the priors and belief states held up against outcomes, then feeds that evidence back into the stack: reweight L2 priors, adjust L1 coverage thresholds, revise lifecycle rules, or withhold beliefs whose warrant no longer survives calibration.

This feedback loop is what separates a Belief Stack from static retrieval or static classification. A retrieval system can return what was observed. A classifier can assign a label. A Belief Stack must be able to change what it is willing to believe as evidence accumulates.

Same substrate, asymmetric consumer projections

This is an architectural commitment, not just a design preference. The substrate has two natural consumers — a planner and a human — and v0.4a measured that their optimal projections are empirically different, not just different in style. The planner's optimal projection is sparse; the human's is rich. Both project from the same substrate, but the rendering work and the token budget asymmetrically favor each consumer's needs.

Planner projection (sparse). Bare structured names of currently-active beliefs (belief_type :: claim per cluster), dedup-ranked, budget-bounded. v0.4a measured that ~208 tokens is sufficient at 97.3% planning correctness on operational substrate; adding warrant fields or lifecycle markers to this surface did not improve correctness at the tested budget.
Human inspection projection (rich). Browsable, time-traveled views with full warrant chains, lifecycle timelines, audit trails. What an operator needs to understand, at any past turn, what the system believed and why. Detail demand here is large — and v0.4a's sparse-planner result does not generalize backwards to the human side.

Both projections read from the same belief-state substrate, the same lifecycle events, and the same warrants. The split is in rendering, not in source of truth. The human surface is not a dashboard bolted onto the side; it is a peer query surface over the same lifecycle-aware substrate. A Belief Stack that serves only the planner is half a system; a trace viewer that does not share the substrate with the runtime is a different system entirely.

Demonstrated at fixture level + empirically anchored on planner side

A small read-path slice (8 belief instances, 12-turn coding-assistant session, SQLite-backed) implements both projections over one substrate and passes six acceptance tests. State reconstruction at any past turn is replayed from a lifecycle audit trail rather than read from a snapshot table. Both consumer-specific projections route through the same reconstruction query and differ only in rendering. This demonstrates substrate-side shape compatibility with both consumers without forking.

The planner projection is now empirically backed by Belief Stack v0.3 (a 285-token belief overlay outperformed a 2,037-token raw workflow log by 8 percentage points on planning correctness, at 3.2× lower latency), refined by v0.4a (the bare-structured-names projection at ~208 tokens reached the same correctness; richer renderings did not separate), and replicated across four LLM families in v0.4c1 (sparse maintained-state projections were the strongest average planning surface across the model field — 99.0% correctness at 241 mean input tokens vs 89.3% at 2,502 for raw history). The model was not under-informed by the smaller context; it was overburdened by reconstruction in the larger one. v0.4a.2 ruled out compression as an alternative explanation on a single model by showing that LLM-summarized raw context at matched budget does not recover the lift; v0.4c1 then showed that this compression-vs-substrate isolation is model-dependent — clear on Opus, partial on GPT-4o, failing on Gemini Pro with thinking, reversed on Haiku’s raw-log baseline — while the underlying thesis (maintained state beats raw reconstruction) held on every tested model.

The human projection's optimal shape is unmeasured but structurally distinct: full warrant chains and lifecycle timelines, not budget-bounded sparse names. The asymmetry between the two projections is the architectural commitment.

Four value surfaces

The asymmetric-projection commitment produces four observable consequences from one root property. Maintained state in the substrate is what every consequence derives from; the consequences are not four independent value props that happen to coexist. Each surface aligns with a different reader — economics for platform owners, latency for product owners, observability for risk/governance, performance for engineers.

1. Economics

Partially measured. Across four LLMs from three providers in v0.4c1, sparse substrate-derived projections used 241 mean input tokens vs 2,502 for raw history — roughly one-tenth the input. Per-call token economics is directly measured. Net end-to-end economics depends on extraction, storage, and maintenance costs that the planning-side experiments do not measure; v0.4b is scoped to close that frontier.

2. Latency

Measured. In v0.3, the maintained-state arm answered planning questions in ~31% of the raw-context arm's wall time (3.2× faster). In v0.4c1, the cross-model average showed ~1.4× lower per-call wall time. Users experience latency directly — a user doesn't know that 2,000 tokens were saved; they know whether the agent took 8 seconds or 2 seconds.

3. Observability

Structural. The substrate preserves warrants and lifecycle history for human inspection while the planner consumes sparse names. The audit surface is an architectural property of the asymmetric-projection design — it exists whether or not the planner uses the sparse projection on a given turn. Governance outcomes (does this audit surface help a human catch errors, satisfy a compliance requirement, or support post-incident review?) are not yet measured and are reported as a separate research direction rather than as a claimed result.

4. Performance

Measured, bounded. Across four models, the maintained-state arm reached 99.0% planning correctness vs 89.3% for raw history on 75 paired single-next-action planning questions. The lift held directionally on every model tested. Results are bounded to one operational substrate (Claude Code session logs); the single-substrate caveat is what v0.4c2 cross-substrate replication is designed to address.

The canonical value statement

Belief Stack externalizes current state from the context window. That creates four benefits: smaller planner inputs, faster planning per call, a human-inspectable state substrate, and better planning correctness on measured operational tasks. Latency and planning correctness are directly measured. Input-token reduction is measured, but net end-to-end economics depends on cost terms still open. Inspectability is an architectural property that still needs governance-outcome validation.

Net-value frontier

The three value surfaces sit above the line of a net-value equation. The three terms below the line are what determines whether the architecture is net-positive at scale.

Net value =
  + savings from smaller planner calls       (measured)
  + savings from fewer bad actions / retries (measured)
  + value of auditability                    (structural)
  − cost of belief extraction                (open)
  − cost of storage / retrieval              (open)
  − cost of maintenance / updates            (open)

The planning-side experiments in this program measure the two savings terms directly. The auditability term is structural — it is a property of the asymmetric-projection design rather than a measured governance outcome. The three cost terms below the line depend on the extraction method (deterministic rule engine vs LLM-driven), on storage scale, and on lifecycle runtime. They are not yet measured on a real workload.

The honest framing for what is known today:

Measured: faster planning per call; better planning correctness on operational tasks; smaller input tokens per call.
Structural: human inspectability of the substrate.
Open: net runtime economics (extraction, storage, maintenance cost terms).

v0.4b is the experiment scoped to measure the open terms on a real workload. Until it lands, the program does not claim a specific net-value number; it claims the upside is measurable and the downside is what v0.4b will quantify.

Operational split: inline vs batch

A practical note that bites once you actually build this: the layers do not all want to run at the same cadence. They split cleanly into two paths along an operational-cost / latency boundary.

Path	Layers	What runs per call	Cadence
Inline	L0 → L1 → L2	Event arrives → assign to region → read region prior → emit prediction with warrant	Per-event, millisecond latency
Batch	L3 → L4	Gather historical folds → compute calibration per region → update lifecycle states → re-weight L2 or re-cluster L1 if warranted	Out-of-band, scheduled (nightly, weekly)

The inline path is light and mostly stateless — a region assignment plus a read against a per-region prior table. Production code can call it on every event without affecting latency.

The batch path is heavy and stateful — it needs accumulated history, multi-row window functions, and (sometimes) re-clustering. Running it inline would either kill performance or starve it of the multi-event context it actually needs. Out-of-band keeps the production system responsive and lets the calibration work breathe.

Implementation note

The split also explains why the feedback loop (L4 → L2 / L1) is inherently slower than the forward path. The system reads its priors inline at full speed; it revises them on a separate cadence. Conflating the two — trying to re-cluster or re-calibrate on every event — is the most common implementation mistake.

Empirical status

The usage patterns below are tracked through empirical anchors — pre-registered case studies and prototypes under lock / run-once discipline — so the spec reflects what has been tested, what failed, and what remains open. Full reports live in the case studies; this section is the spec's running honest position.

Summary (post-v0.4a, 2026-06-04)

Supported

Maintained state beats reconstruction. v0.3 measured an 8-point planning-correctness lift from a substrate-derived projection over a raw 20-turn log; v0.4a replicated the maintained-state-vs-raw delta at +5.3 pp.
Compression alone does not explain the lift. v0.4a.2 ran an LLM summary of the raw log at the same token budget that produced v0.4a's Arm B (Arm A′, matched generator, temperature, seed, max output). A′ reached 90.7% — Arm A's level, not Arm B's. The substrate transformation is the load-bearing operation.
Sparse substrate-derived projections support planning at dramatically lower token budgets. v0.4a Arm C — bare belief_type :: claim per cluster, no warrant fields, no lifecycle marker — reached 97.3% planning correctness at ~208 input tokens. The minimum-sufficient projection observed in v0.4a is sparser than v0.3 made it look.

Unsupported or weakened

Lifecycle markers in the planner projection improve planning. v0.4a.1 Arm E (full discipline with [active]/[weakened]/[contradicted] prefix in the projection) ≈ Arm B; Arm D (warrants without the lifecycle prefix) ≈ Arm C. Lifecycle as substrate machinery remains load-bearing (filtering retired beliefs before projection). Lifecycle as projection content does not separate at the tested budget on the operational substrate.
Warrant fields in the planner projection improve planning. v0.4a.1 Arm D (claims with auth / evidence / decay / last per cluster) underperformed Arm C (claims only) by 1.3 pp at +60% input tokens. The harm and the cost both concentrate at the warrant-introduction step. Warrants as substrate remain part of the architectural contract; warrants as projection content for the planner did not pay at this cell.

Open

Cross-substrate transfer. All measured results are on operational substrate (Claude Code session logs). Sensemaking, multi-actor planning, narrative tracking, and other domains may exhibit different projection-vs-substrate trade-offs. v0.4a's single- substrate caveat is the largest live risk to the architectural claims above.
End-to-end economics. Substrate-side maintenance cost (write-path: deriving belief instances and lifecycle events from real event streams) has not been measured. v0.4a measured planning consumption only. The economic question — when does the maintenance loop amortize against query savings — is still open.
Model variance. All v0.3 / v0.4a results held the generator fixed at gpt-4o-2024-08-06. Whether other models (Claude family, Gemini, smaller models) produce the same projection trade-off is untested.
Higher budget regimes. v0.4a tested at ~285 token cap. At larger budgets the richer projection renderings may have room to demonstrate value that they did not at the small budget. The Arm C Pareto dominance may not transfer.

Per-usage-pattern detail follows. The summary above is the load-bearing one; the sections below preserve the per-pattern chronology and the case-study links.

Sensemaking

Instantiated in Sensemaking v1.5 over financial-market narrative pressure across 31 actors and 173 days. Locked methodology produced near-chance aggregate calibration with measurable regional heterogeneity. The pattern works as a measurement substrate; whether it produces actionable forward signal beyond the existing pipeline is the open question for v2.

Reasoning-trace state management

Operational Belief-State Grounding v0.1 tested the AI-facing overlay surface. System B received the same recent log as System A plus an additive operational belief overlay. On 73 paired questions, the overlay reduced aggregate operational error from 11.0% to 5.5%, with the largest improvement on false-completion claims, 27% to 7%. The result supports compact belief overlays as a practical grounding surface, while the two oversized-overlay failures motivate budgeted overlay ranking in v0.2.

Why this matters

Long-running assistants do not only need more context. They need to know which workflow assumptions are still allowed to act. A recent log can show that a file was edited, a command was run, or a user asked for a deploy. It does not necessarily preserve the current operational state: whether the fix was validated, whether approval is still pending, whether a prior failure has repeated, or whether "done" is still warranted.

Operational Belief-State Grounding v0.1 tested this directly. System A received the last 20 turns of the session log. System B received the same log plus an operational belief overlay: active beliefs such as validation_pending, fix_attempted, pipeline_running, and action_blocked. On 73 paired questions, the overlay reduced aggregate workflow-state errors from 11.0% to 5.5%, with the largest improvement on false-completion claims.

The result does not show that Belief Stack "solves agents." It shows a narrower point: maintaining operational beliefs can help an LLM answer questions about workflow state more accurately than recent log context alone.

For the record, the locked numbers from Operational Belief-State Grounding v0.1 (paired n=73): aggregate operational error 11.0% → 5.5%; false-completion claims 27% → 7%; stale-validation 14% → 7%; repeated-failure and premature-action both zero in both systems (non-results); missing-pause tied at 13%.

Preference judging was supportive but narrower: blind judging preferred the overlay on traceability, 47% vs 32%, while appropriate caution was mostly tied. The result supports the operational wedge: maintained beliefs about validation, pending state, action blockers, and completion status can carry workflow-state information that the same recent log does not reliably expose.

The limit is equally important. Two System B contexts failed generation because unbounded overlays exceeded the OpenAI organization's TPM cap. This does not invalidate v0.1, but it makes the v0.2 design question clear: operational overlays need prioritization or ranking before production use.

Operational Belief-State Grounding v0.2.2 tested whether the v0.1 lift survived compression. It did: a 100-token ranked overlay matched the v0.1 deterministic error reduction, cutting aggregate operational-state errors from 10.7% to 5.3%, while eliminating the v0.1 oversized-context failures. Larger overlays did not improve results; B500 degraded to 8.0%, suggesting that AI-facing overlays work best as ranked attention compressors, not database dumps. Preference judging favored raw-log answers on visible traceability, reinforcing the two-surface design: compact overlays for models, full traces for humans.

Belief Stack v0.3 extended the test from the inspection surface to the planning surface, under a substitutive comparison rather than the additive comparison of v0.1 / v0.2.2. Three arms over the same 75 single-next-action questions: raw log + strong-baseline reconstruction prompt (A), belief overlay only with no raw log (B), overlay plus a K=3-turn scratchpad (C). The smallest arm won outright: A reached 90.7% planning correctness on 2,037 mean input tokens; B reached 98.7% on 285 tokens (14% of A) and 1.11 seconds per call (3.2× faster than A); C reached 94.7%. Zero grounding-bankruptcy candidates — there is not a single question where B failed and both A and C succeeded. The model was not under-informed by the smaller context; it was overburdened by reconstruction in the larger one.

Evidence-backed claim (sharpened post-v0.4a, replicated post-v0.4c1)

Maintained state is a planning primitive.

Substrate-derived projections of currently-active beliefs beat raw context at a fraction of the tokens (v0.3 + v0.4a). LLM compression of raw context at matched budget does not recover the lift on a single model (v0.4a.2). The thesis replicates across four LLM families from three providers (v0.4c1): every B−A and C−A delta is positive; Arm C is the Pareto winner or tied for winner on every tested model. The substrate transformation — not the projection format — appears to be doing the work on non-thinking models; the compression-vs-substrate isolation is model-dependent and does not yet generalize cleanly. Three pre-registered experiments, not a universal claim, but strong enough to justify treating maintained state as a first-class agent architecture component distinct from context summarization, from agent memory, and from log-based observability.

Belief Stack v0.4a — the mechanism ablation that followed v0.3 — was scoped as a two-phase experimental program designed to test v0.3's attribution. Phase 1 (5-arm ladder A/B/C/D/E) found that the spec's projection-side discipline (warrants and lifecycle markers rendered in the planner's context) does not add measurable correctness over a maintained-summary baseline at the 285-token budget on this substrate. The ladder flattened at Arm B (97.3%); Arm C (bare structured names, no warrants, no lifecycle marker) tied at 97.3% on ~208 tokens — strictly Pareto-dominant on every measured axis. Arms D and E (with warrants and lifecycle markers) underperformed at 96.0%. The pre-registered §7 Outcome 5 fired: lifecycle/warrant discipline does not add measurable value over maintained summaries on this substrate.

Phase 2 (v0.4a.2 amendment, locked + run the same evening) added a sixth arm, A′, to isolate compression from substrate transformation. A′ = LLM prose summary of the raw K=20 log at the matched ~285-token budget; same generator, same temperature, same seed, same answer-time prompt as Arm B — only the source differed (raw log vs §3.5a-clustered active beliefs). A′ reached 90.7%, well below B's 97.3% and within the noise floor of A's 92.0%. The pre-registered §12 Outcome A_prime_near_A fired: compression of raw log alone does not reach maintained-substrate correctness; the substrate transformation is doing the work.

Combined interpretation: v0.3's numbers replicate cleanly as the B-vs-A delta. v0.3's attribution of the lift to the spec's full claims+warrants+lifecycle discipline rendered in the projection was over-strong — that specific attribution is empirically weakened by v0.4a.1. The underlying thesis — maintained state is a planning primitive — is empirically strengthened by v0.4a.2 because the obvious counter-argument (compression alone) is now ruled out on a single model. Full report: BELIEF_STACK_REPORT_v0.4a.md.

Belief Stack v0.4c1 extended the program to four LLM models from three providers — gpt-4o-2024-08-06, claude-opus-4-7, gemini-2.5-pro, and claude-haiku-4-5-20251001 — holding the substrate, the evaluation set, and the scoring methodology constant. 1,200 cells total, zero failures. Cross-model averages: 99.0% correctness at 241 mean input tokens for Arm C; 89.3% at 2,502 for Arm A. Every B−A and C−A delta was positive on every model; the thesis held across the field. The pre-registered cross-model classifier returned compression_finding_does_not_generalize: on Gemini Pro (with required thinking mode), LLM-compressed raw history reached the same correctness as substrate-derived prose summary, narrowing the v0.4a.2 compression-vs-substrate isolation to model-dependent. The substrate transformation appears load-bearing on non-thinking models; the mechanism by which the compression-vs-substrate distinction operates depends on whether the model has an internal thinking phase. The thesis (maintained state beats raw reconstruction) is preserved; the mechanism subclaim narrows. Full report: BELIEF_STACK_REPORT_v0.4c1.md.

(Earlier work in this pattern: the TKOS log-replay v1 TKOS log-replay v1 case study established the 164-session substrate and surfaced a structural signature-extraction bug that the operational v0.1 scorer now corrects.)

Stack-grounded intelligence

Instantiated in Stack-Grounded Retrieval v0.1 (locked 2026-05-31): 75 questions across five categories, paired System A (raw L0 chunk RAG) vs System B (L0 + L1 + L2 belief objects), under identical-prompt constraint. The deterministic gate fell against the belief-state system on four of five metrics:

Metric	System A	System B
stale-claim	1%	12%
unsupported-claim	14%	28%
contradiction-omission	24%	36%
insufficient-warrant overclaim	1%	8%
evidence-boundary	1%	0%

Preference was mixed — System B won caution (57% vs 28%), A won sensemaking usefulness (61% vs 31%), traceability close. A follow-up rendering prototype (C1) moved preference toward B but did not close the deterministic gap.

The lesson the spec carries forward: the experiment did not falsify the architecture; it falsified one operationalization. Specifically, raw belief objects under a minimal prompt — without L0 inlining, without a consumption-contract semantic layer, without annotation-on-retrieval — behave more like clustered-narrative-lifecycle metadata than like beliefs in the sense §0 defines them. That observation drove the spec's sharpening of the belief definition itself, from label + warrant to claim + warrant + lifecycle. The category-level pattern (B wins lifecycle/warrant questions, loses synthesis-heavy ones) suggests the substrate is fit-for-purpose for some question shapes and not others; v2 directions remain open.

Operational Belief-State Grounding v0.1 tests the more naturally typed case. There, the overlay is additive rather than substitutive: the model still receives the recent log, and the belief layer supplies the maintained state that the recent log may not surface. That additive pattern is now the cleaner near-term implementation path. The operational use case is the strongest empirical anchor for Belief Stack today. Stack-grounded intelligence remains valid as a harder research track, but the first measured lift is operational belief-state grounding.

Schemas (v0.1)

One schema is pinned in this revision: the warrant schema, which defines the structural requirements for the evidence record attached to every L1 assignment. Additional schemas (listed below as forthcoming) will pin in subsequent v0.x revisions as the spec settles. Schemas are versioned; the schema_version field is required on every instance.

Warrant schema

Canonical schema: /schemas/warrant-v0.1.json

Required fields, applicable to all warrant types:

{
  "schema_version":   "warrant-v0.1",
  "warrant_type":     "decaying" | "invariant",
  "birth_timestamp":  "<ISO 8601>",
  "support_n":        <integer ≥ 0>,
  "coverage_status":  "IN_DISTRIBUTION" | "OUT_OF_DISTRIBUTION" | "UNCLASSIFIED"
}

Decaying warrants additionally carry relaxation_half_life_seconds; invariant warrants carry evidence_refs and validation_status. Optional fields for clustering-based L1: distance_to_centroid, coverage_threshold, confidence. See the canonical schema for the full specification.

Future schemas (not yet pinned)

The following schemas will be added in subsequent v0.x revisions:

belief-step-v0.x — a single (label, warrant) assignment for a step in a reasoning trace
region-assignment-v0.x — an L1 output object including the region label and full warrant
lifecycle-event-v0.x — an L3 event marking a belief's state transition
intervention-applicability-v0.x — the governance-layer object that gates whether an intervention may fire against current evidence

Open questions

Things this v0.1 deliberately does not settle. These are the places where reader feedback is most useful.

Whether a single warrant schema can carry both decaying and invariant cases cleanly, or whether they should be separate schemas inheriting from a base.
What confidence representation is canonical — heuristic softmax-over-distances, calibrated density likelihood, or something more substrate-specific.
Whether L1.2 (drift state for trajectory substrates) belongs as a sibling to L1 or as a field on L1 warrants. Currently modeled as sibling; the alternative has not been ruled out.
How coverage thresholds should be set — held-out validation, density quantile, or post-hoc heuristic — and whether the spec should mandate one approach.
Whether lifecycle state should be a categorical enum (as currently modeled) or a continuous warrant-decay function.

Versioning

This spec follows major.minor versioning. Schema additions are minor revisions; semantic changes that break existing instances are major revisions. The current version is v0.1. Pinned schema files at /schemas/ remain stable; new versions ship as separate files alongside the old ones.

Citation

BibTeX:

@software{stranburg2026beliefstack,
  author  = {Stranburg, Susan},
  title   = {The Belief Stack: An architectural pattern for runtime
             belief lifecycle management},
  version = {0.1.0},
  year    = {2026},
  url     = {https://topicspace.ai/research/belief-stack},
  note    = {Open specification, early stage}
}

APA:

Stranburg, S. (2026). The Belief Stack: An architectural pattern
  for runtime belief lifecycle management (Version 0.1.0)
  [Open specification]. TopicSpace.
  https://topicspace.ai/research/belief-stack

Related work

Essays, case studies, and research notes that develop the concepts in this spec and demonstrate them on specific substrates.

Reading the TopicSpace Belief Field: A v1 Case Study — descriptive and behavioral audit of the Belief Stack pattern applied to a financial markets / AI ecosystem sensemaking substrate.
Watching an Assistant Forget: A TKOS Log-Replay Case Study — preregistered offline audit of long-running Claude session logs; v0.1 / v0.2 head-to-head with full TP/FP/FN/TN accounting on a typed operational substrate.
The LLM That Forgot Time — anatomy of a runtime epistemic failure across the whole belief lifecycle.
Earned Governance for Runtime LLM Intervention — discipline applied to the optional GOV layer: governance rules must demonstrate measurable value before deployment eligibility.
A pattern for problems where beliefs must evolve — the five-layer stack that motivates this spec.
Region-based evaluation and guardrail calibration — empirical anchor for the L1 + L2 framing on real conversation logs.
From static mimicry to active epistemology — the broader category claim (REPI) treated as a proposed direction.

Open specification, early stage.  Feedback, suggested revisions, and collaboration welcome.  Reference implementation (open source) and additional schemas will publish at github.com under a permissive license; that work is in progress and is not yet part of v0.1.
Last revised: 2026-06-02 · revised.

Follow the research

Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.

I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.