Method note

Methods

What the topicspace pipeline actually computes today — written from a literal reading of the code, not from the architecture essay. Designed to be auditable: every claim references the file that produces it.

topicspace researchMay 15, 202612 min read

Honesty note. The stacked-fields essay describes higher-order ambitions (belief revision over time, performance feedback into expectation). This page is narrower: it documents the deterministic parts of the pipeline that are operational today. Where the essay language is metaphorical or aspirational, this page says so.

Signal ingestion

The pipeline ingests events from five sources into a single normalized stream. Each event becomes an Event object with event_id (SHA256 of source + timestamp + title + URL), timestamp, source, title, text, URL, source reliability (0.0–1.0), linked actors, and tags.

Finnhub company news — pulled per-ticker for a hardcoded universe of 37 actors.
NewsAPI — 41 query strings, each mapped to one or more tracked actors.
Reddit — AI overlay subreddits, source reliability 0.50.
X / Twitter — curated accounts plus keyword queries (currently disabled by plan tier).
Amplification (Yahoo Finance, MarketWatch) — breadth signal only, source reliability 0.30.

Title-based tiered dedup runs after ingest: primary/validation news outranks amplification on the same headline. There is no semantic or embedding-level filtering here. An event is kept if it survives the dedup pass and has at least one actor link.

source · scripts/fetch_today.py, scripts/fetch_backfill_new_actors.py

Actor linking

Event-to-actor linking is case-insensitive substring matching against a hand-curated alias map. An event is linked to actor X if any alias for X appears anywhere in the lowercased title + text. A single event can link to multiple actors.

# src/config.py
ACTOR_ALIASES = {
  "NVDA": ["nvidia", "nvda", "jensen huang"],
  "AMD":  ["amd", "advanced micro devices", "lisa su"],
  "ARM":  ["arm holdings", "arm chips", "arm architecture"],
  ...
}

This is shallow on purpose — and it has known failure modes. Substring matching can produce false positives ("arm" inside "alarm", "farm"); there is no word-boundary check, no disambiguation for ambiguous tokens, and no fuzzy or embedding-based matching. Reducing false positives at this layer is on the roadmap.

source · src/actors.py, src/config.py

Narrative pressure (narr)

For each (date, ticker), narrative pressure narr is an integer 0–95 derived from two counts: events in the last 7 days (volume), and events in the last 48 hours (recency burst).

# scripts/build_backtest_history.py
pressure = min(1.0, events_7d / pressure_baseline)
recency  = min(events_48h, recency_cap) / recency_cap
raw      = floor + pressure * pressure_weight + recency * recency_weight
narr     = min(95, round(raw))

# baseline config
floor             = 35
pressure_weight   = 60
recency_weight    = 15
pressure_baseline = 30   # 7-day count that saturates pressure
recency_cap       = 250  # 48h count that saturates recency

narr is event-frequency only. There is no NLP, no embedding analysis, no semantic clustering, and no thematic extraction in this score. "Narrative pressure building" in the product means literal event count rising, not meaning extraction. A storm of repetitive coverage of the same item will move narr the same way as a storm of substantive new coverage.

source · scripts/build_backtest_history.py:97–101

Narrative dislocation score (NDS)

NDS is the signed gap between narrative and price. The literal formula:

NDS = direction * (narr - 50) - rel * 5

where:

direction is +1 for bullish-narrative actors (default) and −1 for a hand-curated bearish list (MU, TSLA, SNOW, CRWV, CRM, INTC). This list is editorial, not corpus-derived.
narr is the pressure score above (0–95).
rel is the trailing 5-day relative return vs QQQ in percentage points (see §05).

Interpretation: positive NDS = narrative ahead of price (story stronger than price action). Negative NDS = price ahead of narrative (price stronger than story warrants). The direction flip for bearish actors means a high narr translates to a negative NDS contribution — the bearish story is being told but price hasn’t paid.

source · scripts/build_backtest_history.py:129

Relative return (rel, rel_5d)

The price counterpart to narr. Computed for each (date, ticker):

stock_return     = (close[t]   - close[t-5])   / close[t-5]
benchmark_return = (qqq[t]     - qqq[t-5])     / qqq[t-5]
rel              = (stock_return - benchmark_return) * 100   # percentage points

Window is fixed at ±5 calendar days globally. Benchmark is hardcoded as QQQ (Nasdaq-100). No sector-specific benchmark, no adaptive horizon, no intraday granularity — daily closes via yfinance.

source · src/price_analysis.py:76–115

State assignment

Each (date, ticker) is assigned one of nine states by a memoryless decision tree over (direction, narr, rel). The full logic:

# scripts/build_backtest_history.py:104-125

if direction < 0:                                  # bearish-narrative actor
    state = "DISAGREEMENT"    if rel > 2.0  else "NEG_CONFIRMATION"

else:                                              # bullish-narrative (default)
    if narr >= 65 and rel >= 5.0:                  state = "CONFIRMED"
    elif narr >= 55 and rel >= 1.5:                state = "EARLY"
    elif narr >= 45 and rel < -5.0:                state = "DIVERGENCE"
    elif narr >= 45 and -5.0 <= rel < 1.5:         state = "REPRICING"
    elif rel < -6.0:                               state = "DIVERGENCE"
    else:
        price_score = max(0, min(100, 50 + rel * 5))
        nds = narr - price_score
        if rel > 2.0 and narr < 60 and nds < -20:  state = "PRICE-LED"
        elif narr < 40:                            state = "UNCLEAR"
        else:                                      state = "MACRO"

The state machine is memoryless. There is no hysteresis, no minimum dwell time, no confirmation requirement. A ticker can flip CONFIRMED → DIVERGENCE → REPRICING on consecutive days if (narr, rel) crosses the thresholds. Stability is something users see through the leaderboard’s daily refresh; it is not a property the engine enforces.

source · scripts/build_backtest_history.py:104–125

Forward expectation generation

For each actor, an LLM produces a structured forward expectation. The prompt context includes:

FIELD CONTEXT: a 2–3 sentence auto-summary of cross-actor patterns (hardware rotation, divergence clusters).
ACTOR-SPECIFIC ANCHORS: hand-curated 1–3 sentence blocks per major ticker.
CURRENT STATE: state, read, NDS, rel, days_in_state, prior_state.
30-DAY TRAJECTORY: rolling history of state, narr, NDS, rel.
PEER CONTEXT: top 8 same-sector actors sorted by |NDS|.

Output is a JSON object with: ticker, headline, direction, conviction (0.0–1.0), near_term_view, medium_term_view, fork_conditions, asymmetric_risk (bull/bear), themes, horizon, watching.

model       = gpt-4o-2024-08-06
max_tokens  = 1500
temperature = 0.6
mode        = json

methodology note

Forward expectations are not themselves backtested. The deterministic state engine (§08) is what produces reliability scores. The quality of LLM-generated expectations is currently editorial — a feedback loop to score and filter them is on the roadmap.

source · scripts/generate_actor_expectations.py

Reliability scoring

The deterministic engine (state + narrative direction → predicted direction) is backtested per actor against forward relative returns.

# scripts/backtest_actor_expectations.py

# For each (date, ticker) in backtest_history.parquet:
predicted = engine(state, narr_direction)    # +1, -1, or 0

# Look forward at 5, 10, 20 trading days
forward_rel = (price[t+h] - price[t]) / price[t] * 100   # vs benchmark
hit = (predicted > 0 and forward_rel > +0.5) \
   or (predicted < 0 and forward_rel < -0.5)             # deadband ±0.5%

# Per-actor verdicts (over 6 months of history):
n_directional  >= 30  AND best hit_rate >= 60%  →  engine_reliable
n_directional  >= 30  AND worst hit_rate <= 40% →  engine_inverted
n_directional  >= 30  AND 40% < hit_rate < 60%  →  engine_unreliable
n_directional  <  30                            →  insufficient

engine_inverted means the engine is systematically wrong on that actor — predictions can be useful when flipped. The actor page does that flip automatically before showing the forward view. engine_unreliable means the engine oscillates near coin-flip and predictions should not be acted on literally.

methodology note

The backtest uses the full available price history (~6 months). The aggregate numbers below are in-sample — reliability classifications are derived from the same window they’re scored on. A held-out walk-forward result follows further down (§08b) so the in-sample caveat is no longer load-bearing.

The deterministic engine is now scored against five baselines (raw engine on all actors, trusted-filtered engine, price momentum, lagged relative strength, news volume, and random) on the same backtest window. The honest summary:

STRATEGY              5d    10d   20d   avg excess @ 20d
topicspace_trusted    62%   61%   64%   +1.97pp
topicspace_engine     46%   47%   50%   +0.75pp
price_momentum        48%   50%   53%   +0.55pp
relative_strength     51%   53%   51%   −0.07pp
news_volume           47%   47%   50%   +0.16pp
random                49%   50%   51%   +0.21pp

The raw engine (all actors) is near coin-flip. The trusted engine — restricted to engine_reliable actors and flipping predictions for engine_inverted — hits 62–64% across horizons. The edge lives in the reliability filter, not in the deterministic state engine on its own. That’s the strongest single argument for why the actor-page badges exist.

The aggregate number hides where the edge actually lives. Slicing the backtest at the 20-day horizon by reliability class:

RELIABILITY CLASS         N      HIT     AVG EXCESS @ 20d
trusted  (engine_reliable)   577    65%    +3.37pp
contrarian (flipped)         894    63%    +1.07pp
uncertain (engine_unreliable)  541    52%      +0.89pp
new  (insufficient)         77       65%      +1.25pp

The raw engine hits 37% on contrarian actors (engine_inverted) — clean evidence the inversion is real. Flipping recovers the 63%. Uncertain actors land near coin-flip with weak excess, which is exactly what the “No clear edge” label is meant to communicate.

And by current state, for the trusted strategy at 20 days:

STATE             N     HIT     AVG EXCESS @ 20d
CONFIRMED         222   72%   +4.74pp
EARLY             281   71%   +2.96pp
REPRICING         699   62%   +1.21pp
NEG_CONFIRMATION  114   60%   +2.37pp
DISAGREEMENT      155       51%      −0.66pp

CONFIRMED and EARLY are where the engine actually makes money. REPRICING and NEG_CONFIRMATION are positive but smaller. DISAGREEMENT shows no edge — the engine’s “positive read on a bearish-narrative actor” logic doesn’t pay over the backtest window. That’s an honest finding worth surfacing on the actor pages as future work (state-specific confidence weighting).

source · scripts/backtest_actor_expectations.py, scripts/baseline_comparison.py
outputs · data/derived/baseline_comparison_summary.md, baseline_comparison_by_reliability.csv, baseline_comparison_by_state.csv

08b

Walk-forward validation: the honest version

The aggregate 62–64% trusted hit rate above is computed in-sample. To check whether the reliability filter generalizes, the backtest window is split into train (first 70% of dates) and test (last 30%). Reliability classifications are derived from the train window only and then used to filter predictions on the held-out test window.

STRATEGY              5d    10d   20d   avg excess @ 20d
topicspace_trusted    55%   50%   55%   +0.39pp  (in-sample: 64%, +1.97pp)
topicspace_engine     49%   53%   49%   +1.09pp
price_momentum        49%   52%   54%   +1.14pp
relative_strength     54%   52%   50%   −0.04pp
news_volume           48%   50%   44%   −0.50pp
random                51%   50%   48%   −0.29pp

Honest read. The in-sample 64% number was substantially curve-fit to the same window’s data. Out-of-sample, the trusted variant retains a small residual edge (55% at 20d vs 48% random; +0.39pp excess vs −0.29pp random) but it is much weaker than the headline. Train tickers classified as: engine_inverted=20, engine_reliable=8, engine_unreliable=4 — quite different from the full-window distribution.

The single 70/30 split could have been an unlucky draw. To check, an expanding-window rolling walk-forward was run with two non-overlapping ~20-day test windows (the most this ~6-month corpus supports). For each test window, reliability is derived from all prior history. Result:

ROLLING (POOLED ACROSS BOTH WINDOWS)
STRATEGY              5d    10d   20d   avg excess @ 20d
topicspace_trusted    53%   52%   52%   −0.27pp
topicspace_engine     50%   51%   53%   +1.41pp
price_momentum        48%   49%   54%   +1.20pp
random                50%   52%   49%   −0.07pp

PER-WINDOW (trusted, 20d):  W1=55%, W2=48%, range=7pp

The result holds across both windows. Pooled OOS, the trusted variant lands at 52% at 20d — essentially random — with average excess of −0.27pp. The raw engine (no reliability filter, no contrarian flip) actually has a higher avg excess at 20d (+1.41pp). The trust filter is not earning its keep out-of-sample.

Four implications follow:

The reliability filter does not generalize on this dataset. ~6 months of price history isn’t enough to derive stable per-actor reliability classifications. The train-derived engine_reliable and engine_inverted labels appear to be largely noise on this window size.
The actor-page “Historical edge” / “Historically inverted” badges describe in-sample edge. They do not represent verified, generalizable historical edge. Treat them as “the engine had high in-sample hit rate on this actor in our backtest window” rather than “this signal is empirically reliable.”
The state-sliced numbers (CONFIRMED 72%, EARLY 71% at 20d) carry the same caveat. They’re computed on the same in-sample window that the rolling validation just contradicted.
The raw engine has modest structural signal. Avg excess of +1.41pp at 20d (vs −0.07pp random) is real but small — broadly comparable to price_momentum at +1.20pp. The structural primitives (NDS, state assignment) have value; the trust-filter layer on top adds noise on this corpus.

Honest verdict. The 62–64% headline in §08 is in-sample only. Out-of-sample, the trust filter washes out to near-random across multiple rolling windows. The system has real structural signal and useful diagnostic surfaces (NDS, state, source relevance, daily diffs, provenance), but the trust filter does not currently survive out-of-sample testing on this corpus. More history and better classification methodology (e.g. shrinkage toward sector means, hierarchical priors, more training data) are needed before the badges should be treated as predictive claims.

source · scripts/walkforward_validation.py, scripts/walkforward_rolling.py (outputs: walkforward_rolling_summary.md, walkforward_rolling_per_window.csv)

08c

Region calibration (F-007 V1)

§08b validates one global filter. §08c asks the opposite question: treating each (theme, direction) pair as its own region of the field, do specific regions calibrate? F-007 V1 joins every L2 expectation version (F-006) with realized 5d / 10d / 20d returns relative to QQQ and aggregates per region. Each observation uses only information available on its date — no look-ahead.

Output schema (one row per region): theme_id, direction_sign, n_obs, and per horizon a hit_rate + avg_excess_return for the full window plus first-half / second-half splits (stability check). Hit rate is direction-aligned: a region with direction_sign = +1 scores a hit when the actor’s forward return beats QQQ; a region with -1 scores a hit when it underperforms. Regions with direction_sign = 0 are kept in the observation table but excluded from hit-rate stats (the prediction had no direction).

Sample-size tiers (applied per horizon, not per region):

n ≥ 10 — public. Tier surfaces as a hit-rate chip on the actor page and as a full row on /architecture.
n 5–9 — limited. Hit rate is shown only on /architecture (the methods/debug surface) annotated as limited so a single streak isn’t mistaken for a stable read. Suppressed entirely on the actor page — an active claim attached to a limited region renders no chip, since the actor page is a reader surface and shouldn’t make a calibration claim with fewer than 10 observations.
n < 5 — insufficient. Hit rate is suppressed everywhere. The actor page renders an explicit history n/a · n=N marker next to such claims — an absence-of-calibration badge, not a hit-rate claim — so a reader sees the claim has no historical baseline rather than mistaking silence for tacit approval.

Baselines are computed at the same horizons so a region’s edge can be read as a delta, not an absolute. Three baselines are emitted:

all_expectations — pool of every signed observation. The honest baseline: what does an arbitrary L2 expectation pay on average?
random — 50% by definition. Useful as a sanity floor.
price_momentum — predicts direction from the sign of the trailing N-day return rather than from the L2 read. Tests whether the L2 layer is doing anything price-only momentum isn’t already doing.

Current state of the field (artifact data/derived/region_calibration.json / public/region_calibration.json):

REGIONS                  417 total · 44 public · 59 limited · 314 insufficient
OBSERVATIONS             2,775 total · 1,750 signed

BASELINE                 5d     10d    20d
all_expectations         49%    50%    51%
price_momentum           49%    48%    52%
random                   50%    50%    50%

PUBLIC REGIONS BEATING 5d BASELINE BY ≥10pp:     9 / 43
PUBLIC REGIONS WORSE THAN 5d BASELINE BY ≥10pp: 12 / 43

Honest read. Aggregate hit rates sit at chance. Price-momentum runs essentially identical to all-expectations, meaning the L2 layer’s direction-pick is not currently out-performing trailing-return sign in the average case. But the field is not uniform: roughly 1 in 5 public regions deviates from the baseline by ≥10pp in one direction or the other. Three patterns stand out:

REGIONS WHERE THE FIELD PAYS                            n    5d   20d
+ Salesforce’s AI-driven market dynamics          10   90%  100%
+ Critical minerals market volatility (bullish)         10   70%   80%
+ Oracle’s AI and multicloud growth (bullish)     14   71%   42%

REGIONS THE FIELD GETS CONSISTENTLY WRONG               n    5d   20d
- Intel’s AI strategy & challenges (bearish)        11    9%    9%
- Marvell Technology investment sentiment (bearish)     12   17%   25%
- Micron strategic expansion in AI chips (bearish)      13   23%   31%

The “wrong” regions are not noise — a hit rate near 10% on a directional prediction means the opposite was true 90% of the time. That’s a recoverable signal: a sign-flip on those specific regions would have paid. F-007 V1 does not act on this. It only measures. V2 will close the loop: re-weight L2 conviction by region calibration, and flag chronically inverted regions for upstream review (either the L1 theme is mislabeled, the L2 sign extractor is reading bullish/bearish in reverse on that theme, or the actor population genuinely fades the news).

What the chips on actor pages mean. A persistent claim attached to a public region (n ≥ 10) on an actor page carries a small calibration chip (e.g. hit 5d 70% · 20d 80% · n=10). It reports that region’s in-sample hit rate to date — pooled across every actor that has ever held the same (theme, direction) claim. It is not a calibrated probability for this specific actor and not a walk-forward result. Treat it as “this region of the field has paid 70% of the time historically”, not “this claim is 70% likely to play out.” Claims attached to limited or insufficient regions either render no chip or an explicit history n/a marker; they never make a hit-rate claim at low n.

Known limits of V1. Hit rates are in-sample (no train/test split per region). Sample sizes are small — even the largest public region has fewer than 100 observations. First-half / second-half splits are emitted but not yet enforced as a stability gate. Walk-forward per-region calibration, conviction re-weighting, and inverted-region detection are V2 work.

source · scripts/build_performance_regions.py, scripts/build_region_calibration.py (outputs: data/derived/performance_regions.parquet, data/derived/region_calibration.json)

Trust labels (frontend mapping)

The badges shown on actor pages are frontend renderings of the pipeline’s reliability classification. They were renamed from TRUSTED / CONTRARIAN / UNCERTAIN / NEW to descriptive labels so they read as observations of in-sample behavior rather than predictive claims:

engine_reliable    →  Historical edge
engine_inverted    →  Historically inverted (direction flipped on display)
engine_unreliable  →  No clear edge
insufficient       →  Insufficient history

These labels do not exist in the pipeline scripts. They are computed in the actor-page component from the underlying engine classification. They are deliberately a frontend concern so that the methodology layer stays separable from how the product chooses to surface it.

L1 field-pressure validation (F-001)

The L1 field layer (semantic density, novelty, drift, dispersion, source-weighted density) is operational and surfaced on actor pages as a debug panel. Before promoting any of its outputs to replace the existing event-count narr, the question is: does a field-derived pressure variant actually improve the state engine?

The validation backtest forks the deterministic state engine to accept a configurable pressure input, then runs 6 variants through the same rolling walk-forward harness used elsewhere: narr (baseline), event_count_7d, semantic_density_7d, source_weighted_density, novelty_adjusted_density, and density_momentum. All variants are normalized via per-actor, point-in-time z-score (90-day rolling baseline of strictly past values) mapped to the existing [0, 95] pressure range, so the engine’s thresholds apply uniformly.

OUT-OF-SAMPLE @ 20D (POOLED ACROSS 2 ROLLING FOLDS)
VARIANT                       N_SCO   HIT   AVG_EXC   MED_EXC
narr (baseline)          896    53%   +1.49     +1.11
event_count_7d           945    53%   +1.35     +1.07
semantic_density_7d      862    54%   +1.47     +1.15
source_weighted_density  844    53%   +1.39     +1.11
novelty_adjusted_density 873    53%   +1.47     +1.12
density_momentum         420    51%   +1.45     +0.67

Verdict: do not promote yet. Field-derived variants do not materially beat event-count narr on this corpus. semantic_density_7d edges out narr by 1pp at 20d (54% vs 53%) — below the noise threshold for two rolling folds. Average and median excess returns are within ±0.15pp across all variants.

State stability is slightly improved by field variants but the difference is small. Field-pressure variants produce 0.331–0.338 state transitions/day vs narr’s 0.346 — about 3% fewer transitions per actor per day. False-spike rate (state changes reverting within 2 days) is comparable across all variants at 31–34%.

What this tells us: the L1 field layer is computing real, point-in-time-correct metrics — but on this short corpus and with this particular state engine, the additional semantic information doesn’t translate into measurably better out-of-sample predictions. The variants also don’t damage performance; they just don’t lift it.

Possible explanations the next iteration should investigate:

The state engine thresholds are tuned for event-count narr. Applying them to a different signal might lose information at the threshold boundaries. A variant-specific threshold sweep is the natural follow-up.
Data-quality issues at the field level. The disagreement report shows tickers (SNOW, VST) with high narr but near-zero density — these are events that survived embedding but had effectively empty body text, so their semantic neighborhood is degenerate. Improving the embed input (richer source-text capture) might help.
The 6-month corpus + 2 folds is too small to detect modest improvements. A 1pp lift is plausibly real but indistinguishable from noise at this sample size. Re-running once the corpus extends to 9–12 months will tell.

Honest call. L1 field instrumentation stays in shadow mode. The actor-page debug panel remains useful for inspection and a future home for density / novelty / drift surfaces, but the production state engine continues to consume narr as its pressure input. The credibility move was running the backtest before promoting; the data didn’t support the promotion.

source · scripts/l1_validation_backtest.py (outputs: data/derived/l1_validation_summary.md, l1_validation_summary.csv, l1_validation_per_window.csv, l1_validation_stability.csv, l1_validation_disagreements.csv)

Known limitations

Where the architecture essay and the implementation diverge today:

Field primitives (density, momentum, drift) are computed per‑storm‑window, not per actor. They feed cluster/storm tracking, not narr or NDS. The essay’s field-primitive language at the actor level is metaphorical in the current implementation.
narr is shallow. Event count + recency only. A semantic / embedding-based replacement is on the roadmap.
Actor linking has false positives. Substring matching with no word boundary checks; "arm" inside "alarm" is a real bug class.
State assignment is memoryless. No hysteresis or dwell-time constraint. State can flip daily.
Forward expectations (LLM) are not backtested. Only the deterministic engine is. Expectation quality is currently editorial.
narrative direction is hardcoded. Six tickers marked bearish; the rest default bullish. Not corpus-derived.
The trust filter does not generalize on this dataset. Rolling walk-forward (§08b) puts the trusted variant at 52% at 20d pooled across windows — essentially random. The structural primitives (NDS, state, source relevance, daily diffs) have real signal; the reliability classification layer on top of them does not currently earn its keep. Treat the “Historical edge” / “Historically inverted” badges as in-sample descriptors, not predictive claims. More history and better classification methodology are needed.
Belief-revision (L3) is partial. Daily refresh and the daily diff exist; persistent expectation IDs with lifecycle events (born / strengthened / weakened / contradicted) do not yet.
Source relevance scoring is shipped (rule-based + LLM hybrid; weak/excluded hidden by default). Source provenance against the forward view is also scored (supports / neutral / contradicts).
Backtest window is short. ~6 months of price history; small per-actor samples in shorter train windows (the "insufficient" tier exists because of this).

The point of this page is auditability. If a claim about the system is not documented here, it should be treated as marketing language rather than methodology. Updates will be dated below.

last reviewed · 2026-05-15

related reading

from information streams to stacked fieldsthe architectural essay this page groundsbacktesting topicspacewhat worked and what didn't in the S4 / B4 strategy trackmarket glossary (archived)term definitions used across the original AI-ecosystem product

← all writings topicspace outlook →

sue@topicspace.ai

Follow the research

Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.

I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.