region-based evaluation & guardrail calibration
Most AI evaluation collapses behavior into a single pass/fail score. A more useful question is where a model is reliable, where it's unstable, where risk priors are wrong, and where guardrails will over-fire or under-fire. A first attempt at region-based calibration, on real conversation logs.
Most AI evaluation and guardrail systems collapse model behavior into aggregate scores: pass rate, hallucination rate, refusal rate, policy violation rate. Those are useful but incomplete. They tell you whether the model passed a benchmark; they don’t tell you where the model is reliable and where it isn’t.
The more useful question is regional: where is the model predictable, where is it unstable, where are risk priors wrong, and where will guardrails over-fire or under-fire? A single aggregate score cannot answer any of these.
This piece walks through a first attempt at region-based evaluation on a real substrate — an AI assistant’s own conversation logs — using the same belief-revision architecture topicspace uses on markets. The architecture is documented at /governance; the live walkthrough with full per-region tables is at /governance/example. The result is honest and partly uncomfortable: the aggregate did not beat baseline. That is the headline. But the regional structure exposed exactly where evaluation and guardrail calibration should focus — which is the part that matters.
The problem with aggregate eval scores
A model is not reliable or unreliable in the aggregate. It is reliable in some regions, unstable in others, and miscalibrated in ways we can measure. Aggregate scoreboards obscure this by averaging across very different contexts.
A model that passes 75% of an eval might pass 95% on simple factual questions and 30% on decision-framed financial prompts. The 75% number is true but uninformative. The regional 95% / 30% split is what tells you where to ship and where to add controls. The same logic applies to guardrails: a guardrail isn’t simply “good” or “bad.” It catches certain regions cleanly, mis-fires in others, and is systematically wrong in a few.
The architecture below treats the question regionally: it learns where the regions are, predicts within each, and measures whether the predictions held up out-of-sample.
The architecture, in one map
Five layers, ported directly from the topicspace markets implementation. The canonical definition lives at /governance:
- L0 — Evidence. What happened? Every interaction is a timestamped, append-only event.
- L1 — Regions. What kind of situation is this? User turns embedded into vector space and clustered into learned regions.
- L2a — Behavior prediction. Given a region, what is the model likely to do next?
- L2b — Risk priors. Given a region, what risks are likely to fire?
- L3 — Lifecycle. How is each region’s pattern evolving over time? Born, strengthened, contradicted, inverted.
- L4 — Calibration. Walk-forward measurement: did the L2 predictions hold up on held-out data, region by region?
The key shape is that L2 carries two parallel prediction targets over the same regional substrate. Performance evaluation reads L2a calibration: how often did the model do what we expected? Guardrail calibration reads L2b: how often did each risk class actually fire compared to the prior we held? Same regions, same walk-forward discipline.
The experiment
The full per-region calibration tables and miscalibrated-cell views live at /governance/example. Method:
Substrate: 2,120 paired observations of (user turn → next assistant action) drawn from an AI software-engineering assistant’s conversation logs over roughly two months. User turns were embedded with text-embedding-3-small and clustered into k = 12 learned regions via K-means. Two prediction targets:
- L2a behavior: per-region distribution over the first tool the assistant calls next (Bash, Edit, Read, Write, … or text-only).
- L2b risk priors: per-region probability that each of four risk classes fires next: destructive operation (Tier 1, regex), over-engineering (Tier 2, LLM judge), over-claiming (Tier 2, LLM judge), unverified claim (Tier 3, LLM judge).
Chronological 80 / 20 train / test split. For each test turn: find the nearest region centroid, look up the region’s training-time predictions, score top-1 / top-3 / Brier on behavior and prior vs actual rate on risks. Walk-forward only — predictions never see the held-out future.
The honest result
On held-out behavior prediction:
| metric | value | note |
|---|---|---|
| top-1 accuracy | 57% | region’s most-likely tool matches actual |
| top-3 accuracy | 93% | actual tool in region’s top-3 predicted |
| Brier | 0.692 | multiclass squared error of predicted distribution |
| cohort baseline | 60% | predict the global mode (text_only) for every turn |
| lift vs baseline | −2.8pp | regions did not beat the dumb baseline |
The aggregate did not beat baseline. That is the honest headline. A dumb predictor that ignores regions entirely and just predicts the global mode for every turn does slightly better than the region-conditioned predictor on the held-out set.
This is not a failure of the architecture. V1 measurement layers are supposed to reveal whether signal is broad, narrow, noisy, or absent. The job of V1 is to measure honestly without acting. The aggregate’s job is to anchor the question; the regional structure’s job is to answer it.
The finding that matters: regional heterogeneity
Aggregate at chance, regional structure sharp:
Some regions calibrated cleanly — data analysis and reporting at 88% top-1, data validation request at 76%, design modification request at 67%. The L2a predictions in those regions consistently matched what the assistant actually did.
Other regions failed badly. session continuation / update and task notifications both came in at 12% top-1. The report update request region: 0%. In those regions, the training-time prediction was systematically wrong on the held-out data.
The value is not aggregate lift; the value is knowing where behavior and risk are region-specific. A model is reliable in some regions and not in others. The architecture surfaces both classes by name. That’s the reliability map; the single number isn’t.
The low-calibration regions are also signal of a different kind. When a region’s held-out behavior diverges sharply from its training-time prediction, that often means the region itself is the wrong unit — bimodal, mixing two different patterns the clustering didn’t separate. L4 calibration can improve both L2 risk priors and L1 region definitions. The feedback edge is real.
Guardrail calibration
The L2b layer predicts, per region, the prior probability that each risk class fires in the next assistant turn. On the held-out test set:
| risk class | tier | actual rate (held-out) |
|---|---|---|
| destructive operation | Tier 1 (regex) | 0% |
| over-engineering | Tier 2 (LLM judge) | 44% |
| over-claiming | Tier 2 (LLM judge) | 10% |
| unverified claim | Tier 3 (LLM judge) | 24% |
Tier 1 is the clean safety baseline: zero destructive shell operations across the test window. Tier 2 finds the dominant behavioral risk — over-engineering at 44%, by an LLM judge prompt that flags responses doing substantially more than asked. Tier 3’s unverified-claim rate of 24% maps to a semantic risk class analogous to hallucination: claims about files, code, or external state that weren’t verified in the immediate context.
The aggregate risk numbers are only half of what matters. The other half is whether the per-region prior matched the per-region actual rate. It often didn’t:
Two failure modes show up clearly:
- Over-fire candidates (purple). The training-time prior said the risk would fire often, but on held-out it didn’t.
task notifications × unverified_claim: prior 53%, actual 5%. A guardrail trained on that prior would fire on safe inputs, annoying users. - Under-fire candidates (red). The training-time prior said the risk was rare, but on held-out it dominated.
Content Update Request × over_engineering: prior 13%, actual 57%. A guardrail trained on that prior would miss real violations.
Both are exactly what governance teams need to surface before a guardrail ships. A binary classifier that says “this output is risky” / “this output is safe” is not enough. Guardrails need calibration, not just classification.
Why this matters operationally
The architecture answers, region by region, a set of questions an aggregate score cannot:
- Where is behavior predictable? Regions with high top-1 / low Brier.
- Where is performance unstable? Regions with high Brier or mode-shift across training weeks.
- Where are risk priors wrong? Cells where prior diverges from actual by ≥ 20pp.
- Where will guardrails over-fire? Over-predict cells.
- Where will guardrails under-fire? Under-predict cells.
- Where should monitoring or human review focus? Inverted regions and high-volume miscalibrated cells.
- Where should regions be split or re-clustered? Bimodal regions, where one prior fails to fit a heterogeneous reality.
This is the difference between a pass/fail scorecard and an operating system: the architecture doesn’t just report; it identifies what to do next.
The end state isn’t “Here are calibrated regions.” It’s: this region is strengthening, this control is inverted, this risk reopened after the model version bump, this claim should be retired, and here is the evidence. Region-based evaluation is the layer where each of those statements becomes measurable.
Limitations
This is V1, on one substrate. Several limitations matter:
- Substrate is a single user’s logs. Two months of one assistant’s sessions. Not cohort statistics.
- Horizon = 1 turn. The L2 predictions are only for the immediate next assistant action. Multi-turn task sequences would be richer.
- Action vocabulary is structural. Tool name (Bash, Edit, …) + text_only. A semantic vocabulary (LLM-classified into “explain” / “implement” / “ask”) would be more informative but introduces judge bias.
- LLM-judged risk labels are noisy. Only ~40% of the judge calls produced parseable structured output. Production work would tighten the prompt and use structured outputs natively.
- Small region slices. Some regions have fewer than five held-out test turns, making their per-row metrics individually unreliable.
- k = 12 chosen, not learned. Heuristic. A principled choice would use HDBSCAN or BIC.
- Not a production guardrail. The architecture tests guardrails; it does not replace one. Real-time classification is a different system.
What this is for
The argument is simple: AI governance should move from aggregate scorecards to region-based reliability maps. The goal is not only to know whether the model passed an eval, but where the model is reliable, where it is unstable, and where controls will fail.
Topicspace has been running this architecture on financial-market narrative data for months. The same shape ports cleanly to AI evaluation: events, learned regions, forward expectations and risk priors per region, lifecycle of those expectations, calibration against held-out reality. The honest V1 finding here is the same honest V1 finding the markets implementation produced: aggregate at chance, regional structure carrying the signal.
V1 measures. V2 acts. The findings here would feed a V2 that re-clusters bimodal regions, sign-flips inverted risk priors, re-weights conviction based on calibration, and joins in human feedback as a stronger truth signal — closing the loop the architecture was built for.
A live walkthrough of the same experiment, with the full per-region calibration tables, is at /governance/example. The architecture is documented at /governance. The markets-side live instance: /architecture.
Follow the research
Occasional updates on Belief Stack, TopicSpace case studies, and runtime belief-state evaluation.
I'll send notes when there's a new spec, case study, methodology update, or major finding — not a weekly newsletter for the sake of it.