Backtesting topicspace: what worked, what didn’t
We tested whether narrative-price state could improve stock selection in the AI ecosystem. The result was more specific than expected: raw dislocation was the ranking core, while state and sector context made the signal more disciplined and investable.
Following a fast-moving sector like AI is not mainly a headline problem. It is a timing and interpretation problem.
The same broad story can show up in very different ways. One name may have a strong narrative with no price follow-through. Another may be in clean confirmation. A third may already be stretched because price ran ahead of the story.
That is the distinction topicspace is built around. The question behind this backtest was simple: does that framework actually help?
We tested a narrative-price model across a 250-day window using a reconstructed historical signal built from event activity and relative price return. The goal was not to prove that every label in the system predicts returns. It was to see whether narrative-price state, sector context, and portfolio construction could produce a better selection process than simpler alternatives. The answer was yes — but in a more specific way than we expected.
What we tested
The ranking core of the model is NDS: a measure of the gap between narrative pressure and price behavior. When the narrative is strong but price has not followed, NDS is positive. When price has run ahead of the story, NDS is negative. The direction and magnitude of that gap is the raw signal.
Around that, we tested four layers of structure: a state classification system, a sector eligibility matrix derived from event studies, a reversal overlay for semiconductors, and several portfolio construction approaches.
We also tested whether additional complexity at the ranking layer could improve on pure NDS. That test had a clean result.
What worked
The simplest thing worked best at the ranking layer: names with the strongest positive narrative-price gap tended to perform best. That held across most of the test window and across most sectors.
But raw ranking alone was not the whole story. State and sector context mattered because the same signal did not behave the same way everywhere. The same constructive-looking state — say, DIVERGENCE, where the narrative is building ahead of price — performed well in semiconductors and AI infrastructure but failed in growth software, which was negative at every state and every horizon tested. That made the useful unit of analysis more specific than state or sector in isolation. It was closer to state × sector × horizon.
Once we re-derived the eligibility matrix on the full 250-day corpus — replacing an earlier version built on a 60-day window — the strategy improved sharply. The old matrix had included AI Platform DIVERGENCE (42% hit rate, −1.3% avg) and several Cloud states below the threshold. Removing them and restricting to cells with n ≥ 5, hit rate ≥ 55%, and avg excess ≥ 0% at 10 days fixed the two worst sources of drag.
The strongest standalone sleeve from that process was S4: pure NDS ranking inside the re-derived sector-aware eligibility matrix, with a reversal overlay for semiconductors in seasoned negative-narrative states.
Just as important, S4 remained interpretable. We could explain every position: which eligibility cell it satisfied, why the NDS ranked it into the top five, and what the reversal overlay was doing. That is not a minor point — it is part of what makes the strategy trustworthy enough to shadow-track forward.
What didn’t work
Two things failed cleanly.
Union portfolio blending
S4 and the broader NDS-only sleeve (B4) are genuinely complementary — they have a return correlation of ρ = 0.52 and draw down at different times. But combining them as a union portfolio diluted the stronger sleeve with lower-quality exposures. The blend held an average of 7.6 positions, roughly 2.8 of which were B4-only names from sectors S4 had deliberately excluded (growth software, AI platform). That produced +64.7% cumulative excess at Sharpe 2.19 — worse than either standalone on return, and only marginally better on risk.
The diversification thesis was right. The implementation was wrong. Capital-weighted blending — averaging daily excess returns at 50/50 weight without pooling positions — fixed this by capturing the correlation benefit without diluting either sleeve.
ML ranking overlay
We tested a GradientBoosting model trained to rank the S4 eligible pool by 10-day forward excess return. Features included NDS, days in state, event counts, sector, and direction. Walk-forward training with an expanding window. The result was clear: OOS Spearman ρ between predicted and actual 10-day returns was +0.035. That is indistinguishable from zero ranking ability.
We also tested events_7d — the only genuinely new feature in the ML model beyond what S4 already encodes — as a handcrafted secondary score. In both cases, the overlay hurt performance on the days it changed the portfolio.
What the backtest suggests
The backtest changed how we think about the model.
The ranking core is simpler than we expected. Raw narrative-price dislocation — NDS — does most of the work. The rest of the framework does not add a better universal ranking function. What it adds is discipline: where the signal is valid, where the same-looking state behaves differently by sector, where constructive setups are real, and where price has already outrun the story.
That turns out to matter more than another layer of model complexity.
Each layer narrows the eligible pool and improves the signal-to-noise ratio. The ranking core does the heaviest work; the eligibility layer prevents it from firing in unfavorable sector-state combinations.
This layered architecture also explains why the ML overlay failed. The model was trying to learn a ranking function inside a pool that had already been filtered to the most promising sector-state combinations. Within that constrained pool, the residual cross-sectional variation in 10-day returns was close to noise. There was not enough signal left for the ML model to extract — because the eligibility layer had already done the work.
State and sector eligibility is not a supplement to the ranking signal. It is the primary quality filter. Getting that matrix right matters more than improving the ranking function within it.
Portfolio result
The best risk-adjusted implementation in the test was a capital-weighted blend of S4 and B4. S4 and B4 are genuinely complementary — they draw down at different times, and their return correlation (ρ = 0.52) is low enough that blending them improves Sharpe without requiring either sleeve to compromise its selection logic.
250 trading days, May 2025–April 2026. Cumulative excess and max drawdown vs QQQ. Sharpe computed on daily excess returns. All strategies equal-weight, daily rebalance, 5-name max.
The capital-weighted blend did not maximize raw return — S4 alone produced the highest cumulative excess. But it produced the best Sharpe in the study (2.44 vs 2.10 for S4 alone), and it behaved more consistently across subwindows. In the harder second half of the test period (November 2025 – April 2026), S4 earned +5.1% cumulative excess. The capital-weighted blend earned +10.3%, because B4 carried that period more effectively than S4 did.
That is probably the right way to think about topicspace going forward: not as a single universal signal, but as a signal architecture with a ranking core, an eligibility layer, and a portfolio construction layer — where each layer serves a distinct purpose and is sized accordingly.
What comes next
The ranking track is closed. At this sample size, additional model complexity did not earn its way into the stack — neither the ML overlay nor the event-volume tiebreaker. The current conclusion is more modest, but more useful.
The next step is forward validation: running S4 and Cap S4+B4 live in shadow mode, extending the event history, and building the longer dataset that would justify reopening learned ranking later. The conditions for that reopen are specific: 500+ days of production data, roughly 4,000+ eligible observations, and enough variation across states and sectors to support cross-sectional return prediction.
What the backtest established is a clear baseline — not a finished system. Narrative-pressure signal is real. Sector validity is conditional. State logic helps make the signal investable. And the strongest version of the system, at least so far, is still the interpretable one.
Narrative-price dislocation is a real signal. Sector context is what makes it investable. And the test result that mattered most was negative: at 250 days, the rule-based system still outperforms anything we could learn from the data.