Reading the TopicSpace Belief Field: A v1 Case Study
A descriptive and behavioral audit of the TopicSpace sensemaking pipeline as a Belief Stack instantiation. Demonstrates how the stack maintains a structured belief field — with births, retirements, revisions, warrant checks, and lifecycle states — over a 173-day window across 42 actors and 2,100 expectations.
v1.5 v0.1 appended 2026-05-29; v1.5 v0.2 + v0.3 + v0.4 calibration updates appended 2026-05-30.
Planned canonical URL:
https://topicspace.ai/research/case-studies/sensemaking-v1What this case study tests
The controlling question: does the TopicSpace sensemaking pipeline maintain a structured belief field — with births, retirements, revisions, warrant checks, and lifecycle states — or does it produce a sequence of daily labels that happen to live in the same database? The architectural claim of the Belief Stack pattern is the first thing, not the second. v1 tests whether the implementation matches that claim on a real substrate.
Three properties anchor the evaluation. Is the field moving? Do beliefs come into being, age, and get retired across the window? Are beliefs revised under evidence pressure? Do some claims update mid-life rather than only at retirement? Does the system know when it has warrant? Does it report coverage status, or does it emit a confident read regardless of evidence depth? The descriptive statistics that follow speak to each of these directly.
Effectiveness — whether the system separates forward outcomes under locked pre-registered rules — is out of scope for v1 and is the subject of the v1.5 sections below (v0.1, v0.2, v0.3). Those updates introduce the pre-registration discipline and report results separately; this v1 section remains descriptive only.
What this does not claim
This case study restricts its claims to what the descriptive data supports.
- It does not claim price prediction.
- It does not claim alpha.
- It does not claim that current state labels are final.
- It evaluates whether the stack improves attention allocation and belief revision under noisy evidence — nothing more.
Headline field statistics
Window: 2025-12-05 → 2026-05-26 (173 days). 42 actors tracked. All numbers computed from data/derived/expectation_lifecycle_events.parquet and data/derived/backtest_history.parquet.
| Metric | Value | Note |
|---|---|---|
| Expectations born | 2,100 | ~18.1 / day mean; 95% of days produced at least one |
| Completed lifecycles | 1,998 | 95% reached a terminal state in-window |
| Median lifecycle length | 8 days | p25=3d, p75=18d, max=151d |
| Sufficient-data rate | 95.9% | warrant coverage: (date, ticker) observations with enough evidence to emit a calibrated read |
| Meaningful revision rate | 14.4% | entities with ≥1 reconfirmed / strengthened / weakened / contradicted event |
| Currently alive | 4.9% | most expectations resolved in-window; field is mostly past-tense |
The field is a moving population: roughly one new expectation per 80 minutes of active time, median life of 8 days, almost all resolved by window's end. The 14.4% revision rate is the smallest of these numbers and the most consequential — it separates a snapshot dashboard from a maintained belief field.
Lifecycle path distribution
What shapes do lifecycles take? Across 2,100 entities:
| Path | Count | % |
|---|---|---|
| Born → retired (no revision) | 1,649 | 78.5% |
| Reconfirmed during life | 165 | 7.9% |
| Contradicted → retired | 137 | 6.5% |
| Born, still active (no revision) | 101 | 4.8% |
| Strengthened → retired | 19 | 0.9% |
| Weakened → retired | 22 | 1.0% |
| Contradicted, still active | 6 | 0.3% |
| Strengthened, still active | 1 | 0.0% |
The 78.5% born-then-retired-without-revision is the largest single category and the most easily misread. In this substrate, most expectations are transient. That is not automatically a weakness: short-lived narratives should often expire rather than be forced into durable claims. The important test is whether the system can distinguish ordinary expiration from evidence-driven revision.
The 14.4% of entities that did receive revision pressure — reconfirmed, contradicted, strengthened, or weakened — is the architecturally consequential subset. It is the smallest set in the table, and the one that separates a maintained belief field from a sequence of disconnected daily labels.
Lead trace: When the Field Got Sharper Without Changing Shape
On May 25, the system received ten days of previously missing market evidence. By the morning of May 26, it had revised 24 specific beliefs locally — without changing what it thought about the overall market.
Controlling claim: the system distinguished evidence-volume from background-field shift.
Five-step spine
- More evidence arrived.
- Local warrants changed.
- Specific expectations resurfaced or revised.
- Some actor states sharpened.
- The global field shape stayed broadly stable.
Lifecycle revisions in window (2026-05-18 → 2026-05-27)
| Event type | Count |
|---|---|
| contradicted | 12 |
| reconfirmed | 8 |
| weakened | 3 |
| strengthened | 1 |
| Total revisions | 24 |
Resurfaced expectations — clearest backfill signature
Eight lifecycle events explicitly tagged resurfaced after Nd gap — direct evidence of previously-stale expectations reconfirmed by newly- ingested data. Gap distribution: 8d, 11d, 12d, 14d, 15d, 18d, 21d, 34d.
| Date | Gap | Resurfaced expectation | Δ conv. |
|---|---|---|---|
| 2026-05-18 | 11d | AAPL: AI Features Drive Early Momentum | +0.00 |
| 2026-05-18 | 18d | Promising AI Adoption, Yet Price Lag Persists | +0.00 |
| 2026-05-19 | 15d | PLTR: narrative not converting to price | -0.05 |
| 2026-05-21 | 21d | AWS: cost efficiency faces pricing catch-up | +0.00 |
| 2026-05-26 | 12d | CEG: price surges ahead of nuclear narrative | -0.05 |
| 2026-05-26 | 14d | Narrative surges ahead in AI data-center supplier | +0.05 |
| 2026-05-26 | 34d | PLTR: narrative strength misaligns with repricing | +0.00 |
| 2026-05-26 | 8d | ARM: AI chip expansion faces narrative lag | +0.30 |
Global state distribution — broadly stable
| State | Before (2026-05-22) | After (2026-05-26) |
|---|---|---|
| REPRICING | 11 | 9 |
| CONFIRMED | 8 | 7 |
| DIVERGENCE | 2 | 6 |
| DISAGREEMENT | 5 | 4 |
| EARLY | 4 | 2 |
| NEG_CONFIRMATION | 1 | 2 |
| PRICE-LED | 0 | 2 |
| MACRO | 1 | 0 |
32 actors on each date. Distribution shifted, did not invert. The largest local move: DIVERGENCE rose from 2 to 6 as backfilled positive coverage exposed actors whose narrative had run ahead of price.
State transitions on 2026-05-26
| Ticker | State 2026-05-22 → 2026-05-26 | NDS 5/22 → 5/26 |
|---|---|---|
| NVDA | DIVERGENCE → DIVERGENCE (held) | 73.2 → 79.1 |
| MSFT | REPRICING → DIVERGENCE | 54.0 → 70.1 |
| ADBE | REPRICING → DIVERGENCE | 37.8 → 71.9 |
| GOOGL | REPRICING → DIVERGENCE | 68.5 → 72.4 |
NVDA remained in DIVERGENCE through the backfill (already there before; held after). MSFT, ADBE, and GOOGL moved into DIVERGENCE on 2026-05-26 as the newly-ingested positive coverage sharpened the read on actors with narrative-vs-price tension. This is the shape of the central claim: local resolution improved; the background field did not shift.
The five steps composed cleanly in the data. Evidence arrived (step 1) and local warrants changed (step 2): twenty-four lifecycle revisions fired in the nine-day window, with eight reconfirmations explicitly tagged resurfaced after Nd gap. The tag itself is the architectural signature — the system did not merely revise priors; it tracked which priors had aged out and were being revived by newly-ingested evidence. Eight specific expectations resurfaced (step 3) across gaps of 8 to 34 days, with conviction deltas ranging from -0.05 to +0.30 and several at 0.00 where new evidence reinforced rather than altered the prior.
Some actor states sharpened (step 4): MSFT, ADBE, and GOOGL transitioned from REPRICING into DIVERGENCE on 2026-05-26 as backfilled positive coverage exposed actors whose narrative had run ahead of price. NVDA, already in DIVERGENCE, held — the system did not re-label what was already correctly resolved. The global field shape stayed broadly stable (step 5): the same set of regions appeared in both reference distributions, with proportional shifts rather than wholesale replacement.
This is what the controlling claim looks like in the data. A snapshot dashboard receiving the same backfill would report: more data arrived. The Belief Stack reported: more data arrived, eight specific stale claims revived, four actors ended up in local divergence (three newly, one held), and the overall background field did not shift. The distinction lives at L1 — the warrant carried on each region assignment tracks coverage status, which is what permits local revision without global overreaction. Whether the post-backfill state was retrospectively right (L4 forward calibration) is a separate question, deferred to v1.5.
Supporting traces
Three short traces — values only, not full case studies. Goal is texture for the lead. Underlying data in data/derived/sensemaking_trace_packets.md.
Price-led anomaly — ARM / ALAB / USAR
- ARM: 121 lifecycle events. State held
CONFIRMEDMay 20–27. NDS swung -47 → -120 → -190 → -188 → -119 as price ran ahead of narrative. rel_5d peaked at +45.9%. - ALAB: sustained
PRICE-LEDstate through late May. Current NDS=-152.0, rel=+29.2%. - USAR: sustained
PRICE-LED. Current NDS=-160.8, rel=+29.8%. - Cluster shape: 10 actors in price-led movement on 2026-05-28 (DELL, ARM, USAR, SMCI, ALAB, NBIS, WDC, VST, ZETA, ODC). Infrastructure-heavy.
The cluster shows the field representing a setup where price moves ahead of narrative coverage. ARM held CONFIRMED while its NDS swung from -47 to -190 across late May, signaling sustained relative strength against an unconverged narrative; ALAB and USAR exhibit the same pattern at smaller magnitudes. The system did not force the narrative to match the price action — it surfaced the divergence as its own setup type, observable across an actor cluster rather than a single ticker.
Narrative divergence — NVDA / MSFT
- NVDA: held
DIVERGENCEthrough the May 25 backfill. narr=79 vs rel=-7.6% on 2026-05-27. Strong narrative, price not following. - MSFT: transitioned
REPRICING → DIVERGENCEon 2026-05-26 as backfilled coverage exposed narrative-price tension. NDS 54.0 → 70.1 across the backfill. - Both lifecycle: positive-direction expectations active throughout the window; conviction sustained or rising.
Narrative divergence is the inverse of the price-led setup: story present, price not paying. NVDA held DIVERGENCE through the May 25 backfill — already classified that way before, narrative running ahead, market unconvinced. MSFT transitioned REPRICING → DIVERGENCE on 2026-05-26 as backfilled coverage sharpened the read. Both demonstrate that the field can sustain a positive-direction expectation while explicitly noting the price has not yet validated it.
Clean confirmation — DELL
- DELL:
PRICE-LED → CONFIRMED → PRICE-LEDacross late May. Narrative confirmed by price; conviction sustained. - Current NDS=-106.0, rel=+24.7% on 2026-05-27. Both narrative and price aligned positive.
Clean confirmation is the alignment case. DELL's narrative and price moved together through late May, with conviction held through the lifecycle and no revision events firing. The point is not that DELL was the most interesting actor in the window; it is that the field categorizes alignment as its own setup type. A system that only surfaces tensions cannot tell its consumer where to trust the current read; the CONFIRMED label is itself a coverage statement.
What the stack made visible
The system knew which beliefs to revise locally and which to leave alone, because each belief carried its own evidence record.
Mapping observed behavior to Belief Stack layers
| Layer | What v1 demonstrates |
|---|---|
| L0 Evidence | Ordered, addressable, provenance-bearing event stream (timestamped headlines + price observations per actor). Backfill resurfaced 8 expectations with explicit gap citations — direct evidence that L0 carries provenance. |
| L1 Regions | Stable typology of actor-level setups (CONFIRMED / DIVERGENCE / REPRICING / PRICE-LED / NEG_CONFIRMATION / etc.). Backfill did not produce a wholesale state change. The same field structure remained visible, while local assignments sharpened. |
| L1 warrant coverage | 95.9% sufficient-data rate — fraction of (date, ticker) assignments with enough evidence to license a calibrated read. This is coverage status on the warrant, not L4 calibration. |
| L2 Priors | Per-region directional priors (NDS, narrative score) revised locally during the backfill window. 24 revisions in 9 days. |
| L3 Lifecycle | Born / strengthened / weakened / reconfirmed / contradicted / retired states fired at the right time, with provenance in the detail field (“resurfaced after Nd gap”, “contradicted by opposing expectation”). |
| L4 Forward calibration | Not evaluated in v1. Forward validation of whether L2 predictions held up against realized outcomes is deferred to v1.5, where pre-registered confirmation criteria and a control group are required to avoid self-confirming evaluation. |
What v1 demonstrates is L0 through L3 behavior in motion. The evidence field carried provenance forward across a multi-day window. The L1 typology remained stable while warrants on individual assignments updated. The L2 priors revised in response. The L3 lifecycle events fired with explicit reasons tagged in the detail field. Each layer contributed something the others did not, which is what makes the assemblage a stack rather than a single algorithm with multiple outputs.
What v1 does not demonstrate is L4. Forward calibration — whether the priors that fired under newly-ingested evidence ultimately held up against realized outcomes — requires pre-registered criteria, a control group, and a held-out window. Reporting forward returns alongside descriptive statistics would invite the self-confirming bias the methodology section flags. The honest version: v1 shows the stack is doing what it claims to do, structurally and behaviorally. Whether the priors it emits are predictive in the held-out sense is v1.5 work.
The distinction matters because the architectural promise of the Belief Stack is not a better dashboard. It is a maintained belief field with explicit warrant tracking. A dashboard receiving the May 25 backfill would have updated its readouts. The Belief Stack updated its readouts, surfaced eight specific stale claims that had been revived by new evidence, distinguished local revision from background-field shift, and did all of this with an explicit coverage record that flags when the system has enough warrant to speak. The case study shows the architecture operating that way on real data.
v1.5: Forward Outcome Calibration
Appended 2026-05-29 under a locked pre-registration. v1.5 does not prove prediction. It tests whether the v1 state labels separate forward outcome distributions from a non-flagged baseline drawn from the same corpus, under rolling walk-forward evaluation. The result is mixed, and the value of the v1.5 measurement is in being specific about where the field carries forward-outcome information and where it does not.
v1.5 in one table
Universe: 31 tickers tracked in v1's backtest_history.parquet baseline variant over the v1 window (~120 trading days, 3,627 primary evaluation rows). Forward returns measured at 5 and 20 trading days; relative to QQQ.
| Finding | Result | Read |
|---|---|---|
| Constructive bucket | +2.97% at 20D | modest separation |
| Cautious bucket | +1.81% at 20D | wrong direction |
| Lifecycle revisions | +51bps / −28bps at 5D | cleanest signal |
| Sufficient-data Constructive | +2.99% at 20D | warrant flag informative |
| REPRICING sensitivity | +2.97% → −1.64% | bucket assignment load-bearing |
Primary result: Constructive states show modest 20D separation
The Constructive bucket (CONFIRMED, EARLY, DISAGREEMENT) beats the Ambiguous baseline by an average forward 20D relative return of +2.97% (n=1,086 vs 1,236 baseline). At the 5-day horizon the separation collapses to +0.14% — effectively noise. The hit rate at 20D is 0.477, close to chance.
The signal is not evenly distributed across the bucket. At 20D, CONFIRMED averages +5.25% and DISAGREEMENT averages +7.37%. EARLY averages +0.68% — much closer to baseline and pulling the bucket average down. v0.2 should split EARLY from the Constructive headline or report it separately.
Negative result: the Cautious bucket fails in the expected direction
The Cautious bucket (DIVERGENCE, NEG_CONFIRMATION) does NOT separate downward from baseline. Both states show positive average forward returns at 20D (DIVERGENCE +2.86%, NEG_CONFIRMATION +2.78%) and the bucket overall is +1.81% above baseline — the opposite of the directional implication.
This needs to be stated plainly: under v0.1 rules, the v1 system's “cautious” calls did not predict forward underperformance on this window. Two candidate explanations — neither tested here — are (a) the 2025-12 → 2026-05 AI ecosystem window was structurally constructive (rising tide carrying cautious-labeled names along), and (b) the cautious states are detecting noise that the market subsequently dismisses. v0.2 needs an actor-direction-conditional control to distinguish the two.
Best signal: lifecycle revisions at 5D
Of all the v1.5 measurements, the cleanest single signal is lifecycle revision-prediction at 5 days. Constructive revisions (reconfirmed + strengthened, n=216) outperform baseline by +51bps in average forward relative return. Cautious revisions (contradicted + weakened, n=242) underperform baseline by −28bps. Both directions go the right way, on small but coherent samples.
That is architecturally interesting. The lifecycle layer (L3) may carry sharper short-horizon signal than the static state label (L2). At 20D the cautious-revision result also reverses direction, so the lifecycle finding is specifically a 5-day phenomenon. v0.2 should investigate whether lifecycle events carry a temporally-decaying signal that is most legible immediately after the revision and dissipates over a few weeks.
Warrant coverage: the sufficient-data flag is informative
Partitioning the primary universe by the sufficient_data flag: at 20D, sufficient-data Constructive beats its partition baseline by +2.99% (n=1,068). Insufficient-data Constructive flips sign — it underperforms its partition baseline by −4.43% (n=18). The sample on the insufficient side is small and the numbers should not be over-read, but directionally the coverage flag carries information about whether a state's directional implication will hold.
This is consistent with the v1 coverage-discipline claim from §3 above: when the system marks an observation as low-warrant, it is correctly declining to make a confident claim. v1.5 does not conclude the coverage threshold is well-calibrated. It concludes the flag is informative at this sample size and window.
Load-bearing sensitivity: REPRICING's bucket
REPRICING is by far the largest single state population (1,027 rows at 20D, ~34% of the primary universe). The v0.1 pre-registration classified it as Ambiguous. The sensitivity appendix tests REPRICING-as-Constructive — every other rule unchanged.
The 20D Constructive Δ vs baseline of +2.97% under the locked rule becomes −1.64% under REPRICING-as-Constructive. Same data, opposite headline. This is exactly the kind of finding the sensitivity appendix was designed to surface: the primary measurement's top-line conclusion is fragile to a single bucket decision.
v0.2 must explicitly re-decide REPRICING's bucket. The natural candidate is a narrative-direction-aware split: REPRICING with a bullish narrative direction belongs in one bucket; REPRICING with a bearish narrative direction belongs in another. v1 case-study lifecycle data shows REPRICING being used both ways, which is why v0.1 conservatively declined to commit.
The honest win
v1.5 does not validate the current state mapping as predictive. It validates that the field can be measured against forward outcomes, and it identifies exactly where the mapping needs revision.
Three places: (1) REPRICING's bucket (load-bearing for the headline), (2) the Cautious bucket's directional implication on this window (fails the test), and (3) EARLY's contribution to the Constructive bucket (dilutive). v0.2 is now a locked, specific scope of revision — not an open question about whether the field exists.
v1.5 v0.2: Forward Calibration Update
Appended 2026-05-30 under a separately locked v0.2 pre-registration. v0.1 results are preserved unchanged in the section above; v0.2 was run against the same sample, same horizons, same lookahead protocol. Only the bucket mapping and §12 sensitivity choices changed.
v0.2 did not strengthen the original state-bucket claim. It corrected it.
1. What v0.1 appeared to show
Constructive states had modest 20D separation from baseline: +2.97% in average forward relative return. The 5D separation was small (+0.14%), but at 20D the headline looked clean.
2. What v0.2 corrected
v0.2 split REPRICING by actor-level narrative direction (REPRICING_bullish → Constructive, REPRICING_bearish → Cautious) and isolated EARLY into its own standalone bucket. Both changes were responsive to v0.1's own §12.1 sensitivity, which had flagged REPRICING as load-bearing.
The correction landed harder than expected. With REPRICING removed from Ambiguous, the baseline rose from +1.01% to +3.89% at 20D — because the remaining Ambiguous states (MACRO, PRICE-LED, UNCLEAR) all had higher forward returns in this window than REPRICING did. Against the corrected baseline, the v0.2 Constructive Δ moved from +2.97% to −1.22%. §12.1 reverse sensitivity (REPRICING back in Ambiguous, everything else v0.2) reproduces the v0.1 headline almost exactly, confirming REPRICING's bucket assignment was the entire source of the v0.1 advantage.
Stated plainly: v0.1's headline depended on what was in the baseline. Once the baseline was tightened, the state-bucket headline disappeared.
3. What survived
Lifecycle revision events preserved their directional separation at 5D, even under the v0.2 tightened baseline:
| 5D measurement axis | Constructive direction | Cautious direction | Internal gap |
|---|---|---|---|
| Static state buckets (v0.2) | −0.02% | +1.46% | −1.48% (wrong) |
| Lifecycle revision events | +0.65% | −0.14% | +0.79% (right) |
The lifecycle layer (L3) preserved a constructive-vs-cautious directional gap that the static state buckets (L2) lost under the tightened baseline. Constructive revisions (reconfirmed + strengthened, n=216) averaged +0.65% forward 5D relative return; Cautious revisions (contradicted + weakened, n=242) averaged −0.14%. Both directions go the right way.
4. What this means
Static state labels are brittle to baseline composition. Belief revision events carry cleaner forward-outcome signal on this substrate. That is not a small finding for the architecture: it says the L3 lifecycle layer is doing more work than the L2 state-bucket abstraction, which is exactly where the Belief Stack pattern earns its complexity over a steady-state dashboard.
The v0.2 correction does not invalidate the v1 belief field. It invalidates the simpler bucket-level summary that v0.1 tested. The field still exists, still moves, still tracks warrant. What v0.2 says is that the act of summarizing the field by static state buckets loses information that is present in the lifecycle events themselves.
5. What changes next (v0.3 scope)
- REPRICING becomes its own primary bucket. v0.2's directional split confirmed the v0.1 fragility but produced dilution inside Constructive. v0.3 will report REPRICING's distribution directly rather than folding it into other buckets.
- Lifecycle revision-prediction becomes primary. The static state buckets become a comparison surface, not the headline. v0.3 will lead with the L3 measurement.
- EARLY separates fully. v0.2's standalone bucket hit-rate at 5D and 20D (0.42 and 0.44) sits below chance under Constructive labeling. v0.3 will either drop the directional label or commit to Cautious-like labeling based on what v0.2 showed.
- Cautious states need temporal stratification. DIVERGENCE and NEG_CONFIRMATION continue to show positive forward returns at 20D. v0.3 will add per-sub-window buckets (2025-12 to 2026-02 vs 2026-03 to 2026-05) to test the rising-tide hypothesis.
The line
The useful signal was not “this actor is in a state.” It was “this belief changed under evidence pressure.”
v0.1 surfaced a fragility. v0.2 confirmed the fragility and located the durable signal in a different layer of the stack. That is the iteration discipline working as designed.
v1.5 v0.3: Layered Calibration Update
Appended 2026-05-30 under a separately locked v0.3 pre-registration. v0.1 and v0.2 results are preserved unchanged in the sections above. v0.3 promoted lifecycle revision-prediction to the primary measurement, extracted REPRICING as a standalone unlabeled bucket, demoted static state buckets to secondary, and added per-sub-window stratification.
v0.3 refines the v0.2 thesis. The earlier line — “belief revision carries signal” — was useful but partial. v0.3 shows the signal lives in the relationship between layers.
1. What v0.3 changed
- Lifecycle revision events promoted from secondary (v0.1 / v0.2 §11.1) to primary measurement.
- REPRICING extracted as its own standalone bucket (REPRICING_primary), unlabeled — no Hit / FP / FN assignment, descriptive metrics only.
- Static state buckets demoted to secondary, with Constructive narrowed to
CONFIRMED + DISAGREEMENTonly (REPRICING entirely out, EARLY still standalone). - Sub-window stratification added at the calendar midpoint (2026-03-01) to test the v0.2 rising-tide hypothesis on the Cautious failure mode.
- Same sample, same horizons, same lookahead, same labeling protocol. Only the measurement structure changed.
2. What held
Lifecycle revision preserved the 5D internal gap: +0.79% (Constructive_revision +0.65% vs Cautious_revision −0.14%). The v0.2 finding survived promotion to primary scrutiny — Constructive_revision still outperforms Cautious_revision at the 5-day horizon under the same tightened baseline.
The signal does not hold at 20D. Internal gap inverts to −1.07% (Cautious_revision unexpectedly outperforms Constructive_revision over a 20-day window). The lifecycle layer carries information specifically at the short horizon.
3. What was corrected
v0.2 over-corrected by folding REPRICING_bullish into Constructive. The intent — direction-aware REPRICING — was sound; the side effect was that the Constructive bucket got diluted by ~1,000 REPRICING_bullish rows whose forward returns averaged near zero.
v0.3 fixes this by pulling REPRICING out of Constructive entirely. The pure Constructive bucket (CONFIRMED + DISAGREEMENT only) shows +2.35% Δ vs baseline at 20D — close to v0.1's original +2.97% headline. The progression:
| Version | Constructive bucket definition | 20D Δ baseline |
|---|---|---|
| v0.1 | CONFIRMED + EARLY + DISAGREEMENT (REPRICING in Ambiguous) | +2.97% |
| v0.2 | CONFIRMED + DISAGREEMENT + REPRICING_bullish (EARLY isolated) | −1.22% |
| v0.3 | CONFIRMED + DISAGREEMENT only (REPRICING standalone) | +2.35% |
The v0.2 sign-flip was real — the locked rules produced −1.22% — but v0.3 reveals it was specifically about how REPRICING was being handled, not about the constructive signal itself. The cleanest version of the Constructive bucket separates from baseline at 20D.
4. What became visible
Per-sub-window stratification revealed that the v1.5 window contained two materially different market environments:
| Sub-window | Date range | Ambiguous baseline @ 20D | Constructive @ 20D | Constructive Δ window baseline |
|---|---|---|---|---|
| A (early) | 2025-12-05 → 2026-02-28 | +7.43% | +1.00% | −6.43% |
| B (later) | 2026-03-01 → 2026-05-26 | −2.43% | +11.73% | +14.16% |
Sub-window A was a rising-tide background field — everything underperformed the elevated Ambiguous baseline. Sub-window B was a sell-off background field — the Ambiguous baseline turned negative and Constructive separated by +14.16%, the largest constructive signal anywhere in the v1.5 measurement. The Cautious failure mode in v0.1 / v0.2 was almost entirely window-structural, exactly as the v0.2 §9 rising-tide hypothesis predicted.
State buckets are field-context dependent. The window's aggregate numbers blend two field conditions and obscure both. Sub-window-conditioned numbers are sharper.
5. What this means architecturally
The useful signal was not in one layer. It was in the relationship between layers: L3 revisions captured short-term evidence pressure, while cleaner L2 states captured longer-horizon structure after REPRICING was removed and the field context was stratified.
L3 (lifecycle revision events) carries short-horizon local signal — useful within ~5 trading days of an evidence revision. L2 (static state buckets) carries longer-horizon structural signal, but only when the bucket representation is clean (REPRICING extracted) and the field context is stratified (sub-window separated).
This is a more precise architectural claim than v0.2's “belief revision carries signal.” That earlier line was useful — it located the durable signal — but it implied a layer hierarchy where L3 > L2. v0.3 corrects: the two layers carry different signals at different horizons, under different cleanliness conditions. The Belief Stack pattern earns its complexity because it keeps all three (evidence, state, lifecycle) addressable simultaneously, not because any single layer dominates.
6. v0.4 scope
- Promote sub-window / field-context conditioning to primary. The unconditioned headline blends two field conditions. v0.4 reports primary numbers conditioned on field context first; aggregate as a comparison surface.
- Test event-type granularity at scale. v0.3 §12.1 showed reconfirmed + contradicted carry the 5D signal; strengthened (n=28) and weakened (n=37) are too small. v0.4 expands the lifecycle universe.
- Add a real field-context detector. The v0.3 calendar-midpoint cut worked but is crude. v0.4 tests a market-volatility or rolling-correlation field-context detector against the calendar cut.
- Keep REPRICING standalone unless row-specific direction is validated. Actor-level direction (v0.2) was insufficient. v0.4 could infer row-specific direction from recent narrative-text or return-relative-to-narrative; otherwise REPRICING stays standalone.
v1.5 v0.4: Field-Context Calibration Update
Appended 2026-05-30 under a separately locked v0.4 pre-registration. v0.1, v0.2, v0.3 results preserved unchanged above. v0.4 promoted field-context-conditioned state buckets to the primary measurement and introduced a realized-volatility partition alongside the v0.3 calendar cut. Calendar and volatility partitions are two ways to approximate the background field condition the labels live inside.
v0.4 showed that state-bucket signal is field-context dependent. Static labels became more informative once the field they lived inside was stratified.
1. What v0.4 tested
Two field-context partitions, side by side, on the same sample: (a) the v0.3 calendar midpoint cut (2026-03-01), and (b) a new realized-volatility partition — 20-day rolling realized vol of QQQ daily returns, threshold = window-median rv, LOW_VOL vs HIGH_VOL. Both are approximations of the background market condition each evaluation row sits inside. State buckets unchanged from v0.3: REPRICING_primary standalone, EARLY isolated, Constructive = CONFIRMED + DISAGREEMENT only.
2. What changed
The Constructive 20D signal becomes much sharper once conditioned on the background field condition. The same state label looks very different depending on what the rest of the field is doing at the time:
| View | Constructive 20D Δ baseline |
|---|---|
| v0.1 unconditioned | +2.97% |
| v0.2 over-corrected | −1.22% |
| v0.3 clean bucket | +2.35% |
| v0.4 calendar B_later | +14.16% |
| v0.4 vol HIGH_VOL | +7.95% |
| v0.4 vol HIGH_VOL p67 | +12.70% |
The unconditioned v0.3 +2.35% becomes +14.16% under the calendar B_later partition — five to six times the per-partition signal magnitude. The unconditioned aggregate was blending two very different background field conditions and losing most of the signal in the average.
The mechanism is direct. Sub-window A had a strong background field: Ambiguous names were +7.43% at 20D, so even decent Constructive names looked weak by comparison. Sub-window B had a weak/negative background field: Ambiguous names were −2.43% at 20D, and Constructive names separated sharply against that background by +14.16%. The label didn't change; the field it was being measured inside did.
3. What the partitions agreed on
Both partitions agree directionally about which field condition favors Constructive separation. The stressed background (calendar B_later or volatility HIGH_VOL) produces positive Constructive Δ; the calm background (A_early or LOW_VOL) produces negative Constructive Δ. Sign of the Δ is preserved across partitions at both horizons.
Volatility-threshold sensitivity is monotonic: tightening the threshold from p33 → p50 → p67 sharpens HIGH_VOL Constructive Δ from +6.56% → +7.95% → +12.70%. The more extreme the background turbulence, the cleaner the Constructive separation.
4. What they disagreed on
The two partitions classify about 37% of evaluation rows differently. Calendar A_early ∩ Volatility HIGH_VOL contains 434 rows; calendar B_later ∩ Volatility LOW_VOL contains 713 rows. Most of B_later is HIGH_VOL (1,147 rows) and most defined A_early is LOW_VOL (837 rows), but the cross-cells are substantial.
The disagreement is informative, not a defect:
- Calendar is sharper. The 2026-03-01 cut produces the largest single Constructive Δ in the entire v1.5 measurement (+14.16%). The transition between field conditions was sharper than the 20-day-window vol partition picked up.
- Volatility is more granular. Vol partitions within each calendar half — A_early contains both calm and turbulent sub-periods; so does B_later. The vol partition exposes background-field structure the calendar cut flattens.
- Neither obsoletes the other. They approximate adjacent things. Calendar captures the transition between field conditions; volatility captures continuous background intensity within and across the transition.
5. What persisted
The v0.2/v0.3 lifecycle 5D internal gap (+0.79% aggregate) survives field-context conditioning in bothpartitions under both detectors. LOW_VOL gap = +0.88%; HIGH_VOL gap = +0.50%. Smaller than the aggregate, but directionally preserved. The lifecycle layer continues to carry short-horizon information regardless of the background field condition.
The 20D lifecycle inversion v0.3 surfaced also persists in both partitions — the lifecycle signal is specifically a 5-day phenomenon regardless of which background partition it sits inside.
6. What this means architecturally
v0.4 sharpens the v0.3 thesis. L2 state buckets are not self-sufficient — their forward-outcome content depends on the background field condition they sit inside. The same Constructive label carries +14.16% Δ baseline in B_later and −6.43% Δ baseline in A_early. Without conditioning, the average blends both and looks like +2.35% — most of the real separation is hidden inside the aggregate.
L3 lifecycle revisions, by contrast, carry short-horizon information that is roughly field-context-invariant. A reconfirmed expectation shows ~80bps higher 5D forward relative return than a contradicted one in both calm and turbulent background fields.
The two layers do different work. L3 captures local evidence-pressure dynamics on a short horizon. L2 captures structural relationships that only become legible after the field is stratified. The Belief Stack pattern earns its complexity not because one layer dominates but because the layers carry orthogonal information at different horizons.
v0.4 does not turn the system into a prediction engine. It shows that the belief field needs its own field context. A state label is not self-sufficient; its meaning depends on the background field condition, the warrant behind it, and whether it is static or recently revised.
Limits and next validation
This is v1. The scope is descriptive structure and behavior, not effectiveness or alpha. Several limits are explicit:
- Descriptive, not predictive. v1 evaluates whether the system maintains a belief field with the architectural properties the Belief Stack spec claims. It does not evaluate whether the system's predictions are better than baseline.
- v1.5 v0.1 through v0.4 have now been run. Forward validation of state-prediction under locked pre-registrations is reported in the four sections above. v0.4 confirms that field-context conditioning sharpens the L2 state-bucket signal substantially (Constructive 20D Δ baseline goes from +2.35% unconditioned to +14.16% under the calendar B_later partition), and that the L3 lifecycle 5D directional gap is preserved under both calendar and volatility partitions. The two layers carry orthogonal information at different horizons.
- L4 partially exercised through v1.5 v0.1–v0.4 forward-calibration updates. Forward calibration is evaluated for state buckets and lifecycle revision events under four locked pre-registrations, with TP / FP / FN / TN accounting, sub-window stratification, and field-context conditioning under both calendar and realized-volatility partitions. Broader L4 validation — longer windows, additional field-context detectors, substrates beyond this one — remains future work. The 95.9% sufficient-data rate cited in §3 is L1 warrant coverage, not L4 calibration.
- Single substrate. All numbers come from one user's daily TopicSpace pipeline over 173 days. Generalization across substrates (different markets, different domains, different team configurations) remains to be demonstrated.
- LLM-judge dependencies. Claim generation, narrative scoring, and several lifecycle decisions invoke an LLM judge. These judges carry their own biases and have not been independently validated against human review. The same warrant-coverage discipline applied to the substrate has not yet been applied to the judges themselves.
v1.5 narrowed the validation question to state-prediction; v0.1 through v0.4 are reported above. v0.5 will remove the small look-ahead in the volatility threshold (rolling / expanding-window threshold instead of full-window median), test multi-decile field-condition reporting instead of binary LOW/HIGH cuts, add a trend-conditional field-context detector alongside volatility, and investigate the lifecycle 20D inversion at intermediate horizons (10D, 15D). v2 broadens further: earlier detection, better prioritization, forward validation across longer windows. Each depends on the previous being honest about its own scope.
Related research
- Belief Stack v0.1 specification — the architectural pattern this case study instantiates.
- Watching an Assistant Forget: A TKOS Log-Replay Case Study — sibling case study on a typed operational substrate (Claude session logs), with preregistered v0.1 / v0.2 head-to-head TP/FP/FN/TN accounting.
- The LLM That Forgot Time — the failure-mode anchor for runtime belief lifecycle.
- Region-based evaluation — sibling empirical anchor on conversation-logs substrate.
- A pattern for problems where beliefs must evolve — the architectural framing this case study tests.
Citation
BibTeX:
APA:
Follow the research
Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.
I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.