EXAMPLE · REGION-BASED EVALUATION

Region-Based Evaluation & Guardrail Calibration — on Claude logs

Predictive port of the belief-revision architecture.

User turns are embedded into learned regions, then the system evaluates two forward-looking targets: expected assistant behavior and expected risk class. Both are scored on a chronological held-out set, so the page tests whether regional predictions actually generalize. The architecture is documented at /governance; the markets-side live instance is at /architecture.

Method

L0Every (user turn → next assistant turn) pair becomes one observation. 2,120 pairs total.L1User turns embedded with text-embedding-3-small, clustered into k = 12 learned regions via K-means. Each region gets an LLM-generated sanitized label.L2aBehavior prediction. Per region, predict the next assistant action: direction = most-likely tool, conviction = max probability mass, horizon = 1 assistant turn. Smoothed via Laplace (α = 1.0) over the action vocabulary.L2bRisk-prior prediction. Per region, predict the prior probability that each risk class fires in the next assistant turn — same regions, different prediction vocabulary. Powers guardrail calibration.L3Track each region’s top-action across training ISO weeks. Stable mode → strengthened; mode shifts across ≥3 distinct actions → contradicted.L4chronological 80/20 train/test. For each test turn, find the nearest region centroid, look up that region’s training-time predictions (both L2a and L2b), score against actuals. Walk-forward only — predictions never see the held-out future.

Action vocabulary: Bash, Edit, Read, Write, MultiEdit, Grep, Glob, Task, TodoWrite, WebFetch, WebSearch, ToolSearch, text_only, other.

Overall — held-out walk-forward · 424 test turns

top-1

57%

top-3

93%

brier

0.692

baseline top-1

60%

predict "text_only"

lift vs baseline

-2.8pp

The per-region predictions sit essentially at the cohort baseline in aggregate. Heterogeneity-within-regions is the V1 story, not aggregate lift. This mirrors the F-007 V1 finding on markets: aggregate at chance, real signal concentrated in specific regions.

Heterogeneity within regions — the V1 finding

Aggregate calibration sits at baseline. But per-region calibration is sharply bimodal: some regions cluster cleanly and their L2 predictions hit walk-forward; others cluster noisily and miss. This is the right kind of V1 result — the architecture surfaces which regions of conversation space the system can predict, and which it can’t.

regions where predictions hit

visual design adjustment — predicts Read at conv 0.34; walk-forward top-1 100% on n_test=3
data analysis and reporting — predicts text_only at conv 0.34; walk-forward top-1 88% on n_test=92
data validation request — predicts text_only at conv 0.27; walk-forward top-1 76% on n_test=123

regions where predictions miss

task execution request — predicts Bash at conv 0.35; walk-forward top-1 0% on n_test=1
report update request — predicts Bash at conv 0.36; walk-forward top-1 0% on n_test=5
task notifications — predicts text_only at conv 0.51; walk-forward top-1 12% on n_test=86

L1 + L2a + L4Per-region behavior calibration — predictions + walk-forward12 regions

One row per learned region. Each L2 prediction (direction + conviction + top-3) was set from training-only data; the top-1 / top-3 / Brier columns are walk-forward measurements against the held-out test set. Lifecycle status is inferred from prediction stability across training weeks.

region (label)n trainn testdirectionconvtop-1top-3brierlifecycle

R10data validation request

255123text_only0.2776%90%0.71CONTRADICTED

R9data analysis and reporting

39492text_only0.3488%96%0.58CONTRADICTED

R2task notifications

18986text_only0.5112%97%0.80ACTIVE

R1technical assistance request

8043text_only0.3867%91%0.61ACTIVE

R8data processing and visualization

9918Bash0.3828%100%0.69ACTIVE

R7Market analysis and insights

13417text_only0.4341%82%0.70CONTRADICTED

R0session continuation and update request

12517Bash0.3912%77%0.92CONTRADICTED

R11progress inquiry

2510text_only0.6460%100%0.63STRENGTHENED

R3design modification request

659Bash0.6167%89%0.55STRENGTHENED

R4report update request

945Bash0.360%100%0.75ACTIVE

R5visual design adjustment

683Read0.34100%100%0.52CONTRADICTED

R6task execution request

1681Bash0.350%100%0.69CONTRADICTED

Top-1: predicted direction matched the actual action. Top-3: actual action was in predicted top-3 distribution. Brier: multiclass squared-error of predicted distribution vs one-hot actual (lower = better). Conviction = max-probability mass of the predicted distribution. Rows tinted red have top-1 below 20% — predicting the wrong action in those regions. Tiny test slices (n < 5) are unreliable individual rows.

One regional layer, two prediction targets. The same regional calibration layer supports both performance evaluation and guardrail testing. For performance, L2a predicts the next behavior / action (table above). For guardrails, L2b predicts which risk classes are likely to fire in each region (tables below). Both walk-forward against held-out data.

L2bRisk Priors — Guardrail Calibration Substrate4 risks · LLM judge n = 1200

Same L0/L1/L2 stack, different prediction target. Instead of forecasting which tool the assistant will use next, L2 forecasts the probability that each risk class fires in the next assistant turn — one prior per (region, risk) cell from training data, walked forward against held-out test turns. This is exactly the surface a guardrail would consume: given the user input, how likely is each known risk pattern, and where is the guardrail likely to mis-fire?

risk classtiern testactual rateactual hits

Destructive operation destructive_opTier 1 · regex4250%0

Over-engineering over_engineeringTier 2 · LLM22944%100

Over-claiming over_claimingTier 2 · LLM22910%22

Unverified claim unverified_claimTier 3 · LLM22924%55

Findings on this substrate: destructive_op rate is 0% across 425 test turns — a real Tier 1 safety baseline (no destructive shell ops issued in the substrate). over_engineering fires at 44% per LLM judge — the dominant behavioral risk on this substrate. unverified_claim fires at 24% — the Tier 3 semantic risk that maps to the user’s “trust but verify” saved rule.

Where the L2 risk prior misses reality — guardrail-mis-fire candidates

Per (region, risk) cells where the training-time L2 prior is ≥ 20pp away from the held-out actual rate. Over-predicted regions are where a guardrail trained on this prior would over-fire (false positives — annoying users on safe inputs). Under-predicted regions are where the guardrail would under-fire (false negatives — missing real violations). Both surface failure modes the architecture is built to detect before the guardrail ships.

region · riskkindprioractualΔn test

R4[generic interaction pattern]· over_engineering

under-fire8%83%-76pp24

R2task notifications· unverified_claim

over-fire53%5%+48pp41

R3Content Update Request· over_engineering

under-fire13%57%-44pp49

R1status inquiry· over_engineering

under-fire17%55%-38pp71

R11market intelligence request· unverified_claim

over-fire31%0%+31pp7

R5UI adjustment request· over_engineering

under-fire0%27%-27pp15

R2task notifications· over_claiming

over-fire26%0%+26pp41

R0session continuation and update request· over_claiming

under-fire3%25%-22pp12

R11market intelligence request· over_claiming

over-fire36%14%+22pp7

Top 9 cells by |Δ|. Tiny test slices (n < 5) excluded. Same calibration discipline as the action-prediction L4 above — same regions, same walk-forward, different prediction vocabulary.

Operational implications for evaluation and guardrails

Performance evaluation: regions with low top-1 / high Brier identify behavior areas where the system is unreliable. Per-region calibration is a reliability map, not just an aggregate score.
Guardrail calibration: over-predict cells indicate likely false positives (guardrail fires on safe inputs); under-predict cells indicate likely false negatives (guardrail misses real violations). Both are visible before the guardrail ships.
Monitoring: miscalibrated regions become candidates for human review, re-clustering, or targeted test generation — focus monitoring effort where the architecture says calibration is weakest.
Feedback: L4 calibration can improve both L2b risk priors (conviction re-weighting, sign-flip on inverted regions) and L1 region definitions (re-cluster regions where calibration is bimodal). Same closed loop the architecture is built for.

What this demonstrates

The architecture is not only a guardrail tester. It is a region-based reliability map:

where behavior is predictable,
where performance is unstable,
where risk priors are wrong,
where guardrails will over-fire or under-fire,
and where monitoring should focus next.

Guardrail testing is one application. Performance evaluation is another. Both run through the same L0 → L4 loop on the same regional substrate.

Honest read

This is V1, on one user’s conversation logs. The architecture works end-to-end: embed → cluster → predict → walk-forward measure, with two L2 prediction targets demonstrated (next-tool action and next-turn risk class). The aggregate action-prediction sits at baseline; the risk priors carry real signal but are miscalibrated in specific regions — same F-007 V1 shape (aggregate at chance, signal in heterogeneity).

Horizon = 1 turn. Predicting the immediate next assistant turn only. Multi-turn sequences would be richer.
Action vocabulary is structural (tool name + text_only). A semantic vocabulary would be richer but introduces classifier bias.
Risk vocabulary spans Tier 1–3. Tier 1 (regex) is cheap; Tier 2/3 (LLM judge) inherit the judge’s biases. Successful judge rate on this batch was ~40% (JSON-parse fragility); a production version would tighten the prompt and validate with structured outputs.
Small test slices in some regions. A region with n_test < 5 produces unreliable per-row calibration. Aggregate is built on enough data; individual region rows vary in trust.
k = 12 chosen, not learned. Heuristic; HDBSCAN or BIC-selected k would be more principled.
Self-referential substrate. One user’s sessions. Not cohort statistics.

The architecture documented at /governance ported to a real substrate, using the same forward-looking L2 discipline as the markets-side live instance at /architecture. Per-region sample text, embeddings, and conversation content are kept private; this page publishes only counts, predicted distributions, and walk-forward metrics.