Region-Based Evaluation & Guardrail Calibration — on Claude logs
Predictive port of the belief-revision architecture.
User turns are embedded into learned regions, then the system evaluates two forward-looking targets: expected assistant behavior and expected risk class. Both are scored on a chronological held-out set, so the page tests whether regional predictions actually generalize. The architecture is documented at /governance; the markets-side live instance is at /architecture.
text-embedding-3-small, clustered into k = 12 learned regions via K-means. Each region gets an LLM-generated sanitized label.L2aBehavior prediction. Per region, predict the next assistant action: direction = most-likely tool, conviction = max probability mass, horizon = 1 assistant turn. Smoothed via Laplace (α = 1.0) over the action vocabulary.L2bRisk-prior prediction. Per region, predict the prior probability that each risk class fires in the next assistant turn — same regions, different prediction vocabulary. Powers guardrail calibration.L3Track each region’s top-action across training ISO weeks. Stable mode → strengthened; mode shifts across ≥3 distinct actions → contradicted.L4chronological 80/20 train/test. For each test turn, find the nearest region centroid, look up that region’s training-time predictions (both L2a and L2b), score against actuals. Walk-forward only — predictions never see the held-out future.Bash, Edit, Read, Write, MultiEdit, Grep, Glob, Task, TodoWrite, WebFetch, WebSearch, ToolSearch, text_only, other.The per-region predictions sit essentially at the cohort baseline in aggregate. Heterogeneity-within-regions is the V1 story, not aggregate lift. This mirrors the F-007 V1 finding on markets: aggregate at chance, real signal concentrated in specific regions.
Aggregate calibration sits at baseline. But per-region calibration is sharply bimodal: some regions cluster cleanly and their L2 predictions hit walk-forward; others cluster noisily and miss. This is the right kind of V1 result — the architecture surfaces which regions of conversation space the system can predict, and which it can’t.
visual design adjustment— predicts Read at conv 0.34; walk-forward top-1 100% on n_test=3data analysis and reporting— predicts text_only at conv 0.34; walk-forward top-1 88% on n_test=92data validation request— predicts text_only at conv 0.27; walk-forward top-1 76% on n_test=123
task execution request— predicts Bash at conv 0.35; walk-forward top-1 0% on n_test=1report update request— predicts Bash at conv 0.36; walk-forward top-1 0% on n_test=5task notifications— predicts text_only at conv 0.51; walk-forward top-1 12% on n_test=86
One row per learned region. Each L2 prediction (direction + conviction + top-3) was set from training-only data; the top-1 / top-3 / Brier columns are walk-forward measurements against the held-out test set. Lifecycle status is inferred from prediction stability across training weeks.
Same L0/L1/L2 stack, different prediction target. Instead of forecasting which tool the assistant will use next, L2 forecasts the probability that each risk class fires in the next assistant turn — one prior per (region, risk) cell from training data, walked forward against held-out test turns. This is exactly the surface a guardrail would consume: given the user input, how likely is each known risk pattern, and where is the guardrail likely to mis-fire?
destructive_op rate is 0% across 425 test turns — a real Tier 1 safety baseline (no destructive shell ops issued in the substrate). over_engineering fires at 44% per LLM judge — the dominant behavioral risk on this substrate. unverified_claim fires at 24% — the Tier 3 semantic risk that maps to the user’s “trust but verify” saved rule.Per (region, risk) cells where the training-time L2 prior is ≥ 20pp away from the held-out actual rate. Over-predicted regions are where a guardrail trained on this prior would over-fire (false positives — annoying users on safe inputs). Under-predicted regions are where the guardrail would under-fire (false negatives — missing real violations). Both surface failure modes the architecture is built to detect before the guardrail ships.
- Performance evaluation: regions with low top-1 / high Brier identify behavior areas where the system is unreliable. Per-region calibration is a reliability map, not just an aggregate score.
- Guardrail calibration: over-predict cells indicate likely false positives (guardrail fires on safe inputs); under-predict cells indicate likely false negatives (guardrail misses real violations). Both are visible before the guardrail ships.
- Monitoring: miscalibrated regions become candidates for human review, re-clustering, or targeted test generation — focus monitoring effort where the architecture says calibration is weakest.
- Feedback: L4 calibration can improve both L2b risk priors (conviction re-weighting, sign-flip on inverted regions) and L1 region definitions (re-cluster regions where calibration is bimodal). Same closed loop the architecture is built for.
The architecture is not only a guardrail tester. It is a region-based reliability map:
- where behavior is predictable,
- where performance is unstable,
- where risk priors are wrong,
- where guardrails will over-fire or under-fire,
- and where monitoring should focus next.
Guardrail testing is one application. Performance evaluation is another. Both run through the same L0 → L4 loop on the same regional substrate.
This is V1, on one user’s conversation logs. The architecture works end-to-end: embed → cluster → predict → walk-forward measure, with two L2 prediction targets demonstrated (next-tool action and next-turn risk class). The aggregate action-prediction sits at baseline; the risk priors carry real signal but are miscalibrated in specific regions — same F-007 V1 shape (aggregate at chance, signal in heterogeneity).
- Horizon = 1 turn. Predicting the immediate next assistant turn only. Multi-turn sequences would be richer.
- Action vocabulary is structural (tool name + text_only). A semantic vocabulary would be richer but introduces classifier bias.
- Risk vocabulary spans Tier 1–3. Tier 1 (regex) is cheap; Tier 2/3 (LLM judge) inherit the judge’s biases. Successful judge rate on this batch was ~40% (JSON-parse fragility); a production version would tighten the prompt and validate with structured outputs.
- Small test slices in some regions. A region with n_test < 5 produces unreliable per-row calibration. Aggregate is built on enough data; individual region rows vary in trust.
- k = 12 chosen, not learned. Heuristic; HDBSCAN or BIC-selected k would be more principled.
- Self-referential substrate. One user’s sessions. Not cohort statistics.