Archived research surface·Last refreshed Jun 1, 2026. Not currently maintained as a daily product.
EXAMPLE · REGION-BASED EVALUATION

Region-Based Evaluation & Guardrail Calibration — on Claude logs

Predictive port of the belief-revision architecture.

User turns are embedded into learned regions, then the system evaluates two forward-looking targets: expected assistant behavior and expected risk class. Both are scored on a chronological held-out set, so the page tests whether regional predictions actually generalize. The architecture is documented at /governance; the markets-side live instance is at /architecture.

Method
L0Every (user turn → next assistant turn) pair becomes one observation. 2,120 pairs total.L1User turns embedded with text-embedding-3-small, clustered into k = 12 learned regions via K-means. Each region gets an LLM-generated sanitized label.L2aBehavior prediction. Per region, predict the next assistant action: direction = most-likely tool, conviction = max probability mass, horizon = 1 assistant turn. Smoothed via Laplace (α = 1.0) over the action vocabulary.L2bRisk-prior prediction. Per region, predict the prior probability that each risk class fires in the next assistant turn — same regions, different prediction vocabulary. Powers guardrail calibration.L3Track each region’s top-action across training ISO weeks. Stable mode → strengthened; mode shifts across ≥3 distinct actions → contradicted.L4chronological 80/20 train/test. For each test turn, find the nearest region centroid, look up that region’s training-time predictions (both L2a and L2b), score against actuals. Walk-forward only — predictions never see the held-out future.
Action vocabulary: Bash, Edit, Read, Write, MultiEdit, Grep, Glob, Task, TodoWrite, WebFetch, WebSearch, ToolSearch, text_only, other.
Overall — held-out walk-forward · 424 test turns
top-1
57%
top-3
93%
brier
0.692
baseline top-1
60%
predict "text_only"
lift vs baseline
-2.8pp

The per-region predictions sit essentially at the cohort baseline in aggregate. Heterogeneity-within-regions is the V1 story, not aggregate lift. This mirrors the F-007 V1 finding on markets: aggregate at chance, real signal concentrated in specific regions.

Heterogeneity within regions — the V1 finding

Aggregate calibration sits at baseline. But per-region calibration is sharply bimodal: some regions cluster cleanly and their L2 predictions hit walk-forward; others cluster noisily and miss. This is the right kind of V1 result — the architecture surfaces which regions of conversation space the system can predict, and which it can’t.

regions where predictions hit
  • visual design adjustment — predicts Read at conv 0.34; walk-forward top-1 100% on n_test=3
  • data analysis and reporting — predicts text_only at conv 0.34; walk-forward top-1 88% on n_test=92
  • data validation request — predicts text_only at conv 0.27; walk-forward top-1 76% on n_test=123
regions where predictions miss
  • task execution request — predicts Bash at conv 0.35; walk-forward top-1 0% on n_test=1
  • report update request — predicts Bash at conv 0.36; walk-forward top-1 0% on n_test=5
  • task notifications — predicts text_only at conv 0.51; walk-forward top-1 12% on n_test=86
L1 + L2a + L4Per-region behavior calibration — predictions + walk-forward12 regions

One row per learned region. Each L2 prediction (direction + conviction + top-3) was set from training-only data; the top-1 / top-3 / Brier columns are walk-forward measurements against the held-out test set. Lifecycle status is inferred from prediction stability across training weeks.

region (label)n trainn testdirectionconvtop-1top-3brierlifecycle
R10data validation request
255123text_only0.2776%90%0.71CONTRADICTED
R9data analysis and reporting
39492text_only0.3488%96%0.58CONTRADICTED
R2task notifications
18986text_only0.5112%97%0.80ACTIVE
R1technical assistance request
8043text_only0.3867%91%0.61ACTIVE
R8data processing and visualization
9918Bash0.3828%100%0.69ACTIVE
R7Market analysis and insights
13417text_only0.4341%82%0.70CONTRADICTED
R0session continuation and update request
12517Bash0.3912%77%0.92CONTRADICTED
R11progress inquiry
2510text_only0.6460%100%0.63STRENGTHENED
R3design modification request
659Bash0.6167%89%0.55STRENGTHENED
R4report update request
945Bash0.360%100%0.75ACTIVE
R5visual design adjustment
683Read0.34100%100%0.52CONTRADICTED
R6task execution request
1681Bash0.350%100%0.69CONTRADICTED
Top-1: predicted direction matched the actual action. Top-3: actual action was in predicted top-3 distribution. Brier: multiclass squared-error of predicted distribution vs one-hot actual (lower = better). Conviction = max-probability mass of the predicted distribution. Rows tinted red have top-1 below 20% — predicting the wrong action in those regions. Tiny test slices (n < 5) are unreliable individual rows.
One regional layer, two prediction targets. The same regional calibration layer supports both performance evaluation and guardrail testing. For performance, L2a predicts the next behavior / action (table above). For guardrails, L2b predicts which risk classes are likely to fire in each region (tables below). Both walk-forward against held-out data.
L2bRisk Priors — Guardrail Calibration Substrate4 risks · LLM judge n = 1200

Same L0/L1/L2 stack, different prediction target. Instead of forecasting which tool the assistant will use next, L2 forecasts the probability that each risk class fires in the next assistant turn — one prior per (region, risk) cell from training data, walked forward against held-out test turns. This is exactly the surface a guardrail would consume: given the user input, how likely is each known risk pattern, and where is the guardrail likely to mis-fire?

risk classtiern testactual rateactual hits
Destructive operation destructive_opTier 1 · regex4250%0
Over-engineering over_engineeringTier 2 · LLM22944%100
Over-claiming over_claimingTier 2 · LLM22910%22
Unverified claim unverified_claimTier 3 · LLM22924%55
Findings on this substrate: destructive_op rate is 0% across 425 test turns — a real Tier 1 safety baseline (no destructive shell ops issued in the substrate). over_engineering fires at 44% per LLM judge — the dominant behavioral risk on this substrate. unverified_claim fires at 24% — the Tier 3 semantic risk that maps to the user’s “trust but verify” saved rule.
Where the L2 risk prior misses reality — guardrail-mis-fire candidates

Per (region, risk) cells where the training-time L2 prior is ≥ 20pp away from the held-out actual rate. Over-predicted regions are where a guardrail trained on this prior would over-fire (false positives — annoying users on safe inputs). Under-predicted regions are where the guardrail would under-fire (false negatives — missing real violations). Both surface failure modes the architecture is built to detect before the guardrail ships.

region · riskkindprioractualΔn test
R4[generic interaction pattern]· over_engineering
under-fire8%83%-76pp24
R2task notifications· unverified_claim
over-fire53%5%+48pp41
R3Content Update Request· over_engineering
under-fire13%57%-44pp49
R1status inquiry· over_engineering
under-fire17%55%-38pp71
R11market intelligence request· unverified_claim
over-fire31%0%+31pp7
R5UI adjustment request· over_engineering
under-fire0%27%-27pp15
R2task notifications· over_claiming
over-fire26%0%+26pp41
R0session continuation and update request· over_claiming
under-fire3%25%-22pp12
R11market intelligence request· over_claiming
over-fire36%14%+22pp7
Top 9 cells by |Δ|. Tiny test slices (n < 5) excluded. Same calibration discipline as the action-prediction L4 above — same regions, same walk-forward, different prediction vocabulary.
Operational implications for evaluation and guardrails
  • Performance evaluation: regions with low top-1 / high Brier identify behavior areas where the system is unreliable. Per-region calibration is a reliability map, not just an aggregate score.
  • Guardrail calibration: over-predict cells indicate likely false positives (guardrail fires on safe inputs); under-predict cells indicate likely false negatives (guardrail misses real violations). Both are visible before the guardrail ships.
  • Monitoring: miscalibrated regions become candidates for human review, re-clustering, or targeted test generation — focus monitoring effort where the architecture says calibration is weakest.
  • Feedback: L4 calibration can improve both L2b risk priors (conviction re-weighting, sign-flip on inverted regions) and L1 region definitions (re-cluster regions where calibration is bimodal). Same closed loop the architecture is built for.
What this demonstrates

The architecture is not only a guardrail tester. It is a region-based reliability map:

  • where behavior is predictable,
  • where performance is unstable,
  • where risk priors are wrong,
  • where guardrails will over-fire or under-fire,
  • and where monitoring should focus next.

Guardrail testing is one application. Performance evaluation is another. Both run through the same L0 → L4 loop on the same regional substrate.

Honest read

This is V1, on one user’s conversation logs. The architecture works end-to-end: embed → cluster → predict → walk-forward measure, with two L2 prediction targets demonstrated (next-tool action and next-turn risk class). The aggregate action-prediction sits at baseline; the risk priors carry real signal but are miscalibrated in specific regions — same F-007 V1 shape (aggregate at chance, signal in heterogeneity).

  • Horizon = 1 turn. Predicting the immediate next assistant turn only. Multi-turn sequences would be richer.
  • Action vocabulary is structural (tool name + text_only). A semantic vocabulary would be richer but introduces classifier bias.
  • Risk vocabulary spans Tier 1–3. Tier 1 (regex) is cheap; Tier 2/3 (LLM judge) inherit the judge’s biases. Successful judge rate on this batch was ~40% (JSON-parse fragility); a production version would tighten the prompt and validate with structured outputs.
  • Small test slices in some regions. A region with n_test < 5 produces unreliable per-row calibration. Aggregate is built on enough data; individual region rows vary in trust.
  • k = 12 chosen, not learned. Heuristic; HDBSCAN or BIC-selected k would be more principled.
  • Self-referential substrate. One user’s sessions. Not cohort statistics.
The architecture documented at /governance ported to a real substrate, using the same forward-looking L2 discipline as the markets-side live instance at /architecture. Per-region sample text, embeddings, and conversation content are kept private; this page publishes only counts, predicted distributions, and walk-forward metrics.