Writings›Research note

Backtesting topicspace: what worked, what didn’t

We tested whether narrative-price state could improve stock selection in the AI ecosystem. The result was more specific than expected: raw dislocation was the ranking core, while state and sector context made the signal more disciplined and investable.

topicspace researchMay 20269 min read

Following a fast-moving sector like AI is not mainly a headline problem. It is a timing and interpretation problem.

The same broad story can show up in very different ways. One name may have a strong narrative with no price follow-through. Another may be in clean confirmation. A third may already be stretched because price ran ahead of the story.

That is the distinction topicspace is built around. The question behind this backtest was simple: does that framework actually help?

We tested a narrative-price model across a 250-day window using a reconstructed historical signal built from event activity and relative price return. The goal was not to prove that every label in the system predicts returns. It was to see whether narrative-price state, sector context, and portfolio construction could produce a better selection process than simpler alternatives. The answer was yes — but in a more specific way than we expected.

thesis

Raw narrative-price dislocation was the ranking core. State and sector context made the signal more investable — not by improving the raw ranking function, but by determining where that function was valid and where it was misleading.

What we tested

The ranking core of the model is NDS: a measure of the gap between narrative pressure and price behavior. When the narrative is strong but price has not followed, NDS is positive. When price has run ahead of the story, NDS is negative. The direction and magnitude of that gap is the raw signal.

Around that, we tested four layers of structure: a state classification system, a sector eligibility matrix derived from event studies, a reversal overlay for semiconductors, and several portfolio construction approaches.

We also tested whether additional complexity at the ranking layer could improve on pure NDS. That test had a clean result.

NDS (raw dislocation score)The baseline ranking signal. How far is narrative pressure ahead of, or behind, current price behavior?

State × sector eligibilityAn event-study-derived matrix determining which state-sector combinations have historically shown positive 10-day forward excess returns.

Reversal overlaySemiconductors in seasoned bearish-narrative states (NEG_CONFIRMATION, DISAGREEMENT at day ≥ 5). Identifies mean-reversion setups.

Portfolio constructionEqual-weight top-5 per sleeve; capital-weighted 50/50 blend of the S4 sleeve and a broader NDS sleeve.

ML ranking overlayGradientBoosting and Ridge models trained to rank the eligible pool by 10-day forward excess. Walk-forward, expanding window.

methodology note

250 trading days (May 2025 – April 2026). Reconstructed historical signal from the topicspace narrative pipeline. All strategies equal-weight, daily rebalance, 5-name max, long-only, vs QQQ excess. The backfill-dense first half and pipeline-dense second half split the window into two meaningfully different test subwindows.

What worked

The simplest thing worked best at the ranking layer: names with the strongest positive narrative-price gap tended to perform best. That held across most of the test window and across most sectors.

But raw ranking alone was not the whole story. State and sector context mattered because the same signal did not behave the same way everywhere. The same constructive-looking state — say, DIVERGENCE, where the narrative is building ahead of price — performed well in semiconductors and AI infrastructure but failed in growth software, which was negative at every state and every horizon tested. That made the useful unit of analysis more specific than state or sector in isolation. It was closer to state × sector × horizon.

01

State × sector is the unit of analysis, not state alone

The event study results showed strong heterogeneity by sector. DIVERGENCE was constructive in semiconductors and AI infrastructure but negative in AI platform names. Growth software was negative in every state at every horizon. Treating state as a universal signal would have inverted these distinctions.

Once we re-derived the eligibility matrix on the full 250-day corpus — replacing an earlier version built on a 60-day window — the strategy improved sharply. The old matrix had included AI Platform DIVERGENCE (42% hit rate, −1.3% avg) and several Cloud states below the threshold. Removing them and restricting to cells with n ≥ 5, hit rate ≥ 55%, and avg excess ≥ 0% at 10 days fixed the two worst sources of drag.

02

The eligibility matrix is the most important structural decision

Replacing the 60-day matrix with one re-derived on the full 250-day corpus — removing AI Platform DIVERGENCE (42% hit rate, −1.3% avg) and restricting Cloud to two supported cells — added roughly 37 percentage points of cumulative excess and tightened max drawdown to −16.0%. The lesson is not that more cells are better. It is that poorly-supported cells destroy more value than they add.

The strongest standalone sleeve from that process was S4: pure NDS ranking inside the re-derived sector-aware eligibility matrix, with a reversal overlay for semiconductors in seasoned negative-narrative states.

03

The reversal overlay added real value in semis

NEG_CONFIRMATION and DISAGREEMENT in semiconductors, after day 5 in state, showed 63% and 64% hit rates at 10 days with +4–6% average excess. Adding those positions to the S4 eligible pool improved the Sharpe and return in the second half of the test window, when the semi reversal setups were most active.

Just as important, S4 remained interpretable. We could explain every position: which eligibility cell it satisfied, why the NDS ranked it into the top five, and what the reversal overlay was doing. That is not a minor point — it is part of what makes the strategy trustworthy enough to shadow-track forward.

What didn’t work

Two things failed cleanly.

Union portfolio blending

S4 and the broader NDS-only sleeve (B4) are genuinely complementary — they have a return correlation of ρ = 0.52 and draw down at different times. But combining them as a union portfolio diluted the stronger sleeve with lower-quality exposures. The blend held an average of 7.6 positions, roughly 2.8 of which were B4-only names from sectors S4 had deliberately excluded (growth software, AI platform). That produced +64.7% cumulative excess at Sharpe 2.19 — worse than either standalone on return, and only marginally better on risk.

The diversification thesis was right. The implementation was wrong. Capital-weighted blending — averaging daily excess returns at 50/50 weight without pooling positions — fixed this by capturing the correlation benefit without diluting either sleeve.

04

Union blending is the wrong implementation of two-sleeve diversification

Union blending adds positions from the weaker sleeve into the combined portfolio. Capital-weighted blending keeps each sleeve intact and captures the diversification benefit from their return correlation. The two approaches produced Sharpe 2.19 vs 2.44 on the same pair of strategies.

ML ranking overlay

We tested a GradientBoosting model trained to rank the S4 eligible pool by 10-day forward excess return. Features included NDS, days in state, event counts, sector, and direction. Walk-forward training with an expanding window. The result was clear: OOS Spearman ρ between predicted and actual 10-day returns was +0.035. That is indistinguishable from zero ranking ability.

We also tested events_7d — the only genuinely new feature in the ML model beyond what S4 already encodes — as a handcrafted secondary score. In both cases, the overlay hurt performance on the days it changed the portfolio.

05

ML ranking does not add value at this sample size

250 days × 32 actors × ~30% eligibility rate ≈ 2,400 training observations for a 40-feature model predicting noisy 10-day returns. The ranking track is closed. Reopen only with 500+ days of live production data, when the training corpus becomes viable.

analyst note

The negative ML result is not a failure of the approach. It is a data constraint. The same features that did not support learned ranking at 250 days may support it at 500. The important thing is not to tune toward noise in the interim.

What the backtest suggests

The backtest changed how we think about the model.

The ranking core is simpler than we expected. Raw narrative-price dislocation — NDS — does most of the work. The rest of the framework does not add a better universal ranking function. What it adds is discipline: where the signal is valid, where the same-looking state behaves differently by sector, where constructive setups are real, and where price has already outrun the story.

That turns out to matter more than another layer of model complexity.

04Portfolio construction

Capital-weighted blend of S4 and a broader NDS sleeve. Preserves each sleeve's concentration. Captures diversification benefit from low return correlation (ρ = 0.52).

03Reversal overlay

Semiconductors in seasoned bearish-narrative states (NEG_CONFIRMATION or DISAGREEMENT, days ≥ 5). Identifies mean-reversion setups specific to semi-sector dynamics.

02State × sector eligibility

Each state is tested against each sector at 5D, 10D, 20D horizons in the event study corpus. Only cells with n ≥ 5, hit rate ≥ 55%, and avg excess ≥ 0% at 10D are eligible.

01NDS ranking core

Narrative-price dislocation score: direction × (narr − 50) − rel × 5. Measures the gap between narrative pressure and current price behavior. The strongest raw ranking signal in the backtest.

Each layer narrows the eligible pool and improves the signal-to-noise ratio. The ranking core does the heaviest work; the eligibility layer prevents it from firing in unfavorable sector-state combinations.

This layered architecture also explains why the ML overlay failed. The model was trying to learn a ranking function inside a pool that had already been filtered to the most promising sector-state combinations. Within that constrained pool, the residual cross-sectional variation in 10-day returns was close to noise. There was not enough signal left for the ML model to extract — because the eligibility layer had already done the work.

State and sector eligibility is not a supplement to the ranking signal. It is the primary quality filter. Getting that matrix right matters more than improving the ranking function within it.

Portfolio result

The best risk-adjusted implementation in the test was a capital-weighted blend of S4 and B4. S4 and B4 are genuinely complementary — they draw down at different times, and their return correlation (ρ = 0.52) is low enough that blending them improves Sharpe without requiring either sleeve to compromise its selection logic.

Strategy	Cum. excess	Sharpe	Max DD	Construction note
S4 Rule-based sleeve	+80.3%	2.10	−16.0%	NDS ranking · state/sector eligibility · reversal overlay
Cap S4+B4 Portfolio benchmark	+78.0%	2.44	−17.6%	50/50 capital-weighted blend · preserves each sleeve's concentration
B4 Raw NDS baseline	+72.2%	2.15	−22.2%	NDS > 0 top-5 · no state or sector logic
B3 Price momentum	+77.0%	1.86	−29.8%	Top-5 by trailing relative return · no narrative signal

250 trading days, May 2025–April 2026. Cumulative excess and max drawdown vs QQQ. Sharpe computed on daily excess returns. All strategies equal-weight, daily rebalance, 5-name max.

The capital-weighted blend did not maximize raw return — S4 alone produced the highest cumulative excess. But it produced the best Sharpe in the study (2.44 vs 2.10 for S4 alone), and it behaved more consistently across subwindows. In the harder second half of the test period (November 2025 – April 2026), S4 earned +5.1% cumulative excess. The capital-weighted blend earned +10.3%, because B4 carried that period more effectively than S4 did.

That is probably the right way to think about topicspace going forward: not as a single universal signal, but as a signal architecture with a ranking core, an eligibility layer, and a portfolio construction layer — where each layer serves a distinct purpose and is sized accordingly.

what this is not

These results reflect a single 250-day window, one market environment, and one universe of 32 AI-ecosystem names. The strategies are not live-traded. Cumulative excess vs a passive benchmark is a useful yardstick for signal quality research, but it is not a performance record. Forward validation is the appropriate next step.

What comes next

The ranking track is closed. At this sample size, additional model complexity did not earn its way into the stack — neither the ML overlay nor the event-volume tiebreaker. The current conclusion is more modest, but more useful.

The next step is forward validation: running S4 and Cap S4+B4 live in shadow mode, extending the event history, and building the longer dataset that would justify reopening learned ranking later. The conditions for that reopen are specific: 500+ days of production data, roughly 4,000+ eligible observations, and enough variation across states and sectors to support cross-sectional return prediction.

What the backtest established is a clear baseline — not a finished system. Narrative-pressure signal is real. Sector validity is conditional. State logic helps make the signal investable. And the strongest version of the system, at least so far, is still the interpretable one.

Narrative-price dislocation is a real signal. Sector context is what makes it investable. And the test result that mattered most was negative: at 250 days, the rule-based system still outperforms anything we could learn from the data.

related reading

Modeling narrative and priceHow the narrative-price gap is measured and what it has historically meant

BoardLive narrative-price state across the AI ecosystem

BriefingDaily synthesis and system-level read on the AI ecosystem

GlossaryDefinitions for NDS, rel, state types, and setup labels

← all writings topicspace board →

sue@topicspace.ai