Tools

Tools: EXP-032B: From Fail-Closed Blocking to Reproducible PASS/BLOCK Separation

2026-02-24 0 views admin

Tools: EXP-032B: From Fail-Closed Blocking to Reproducible PASS/BLOCK Separation

Source: Dev.to

A root-cause repair and re-measurement study (observer-shadow scope) with anti-leakage checks, replay drift comparison, and artifact-first reporting. ## How To Read This ## Table of Contents ## Why EXP-032B Exists (and How It Connects to Earlier Posts) ## The Core Experimental Question ## What Changed in EXP-032B (High Level) ## Minimal Architecture (Role First, Internal Names Second) ## 1) Structure Validation Stack ## 2) Reasoning Agents (3 independent channels) ## 3) Arbitration Layer (internal: LawBinder) ## 4) Clinical Governance Gate (internal: CCGE) ## 5) Structural Skepticism Lens (internal: Sydney Lens) ## What We Actually Repaired (Root-Cause Sequence) ## A. Upstream evidence/provenance mismatches ## B. Missing-link candidate generation/ranking bottlenecks ## C. Bio-domain signal path suitability (NNSL path) ## D. Governance/arbitration observability ## Causal Bridge: How These Repairs Relate to the Final PASS/BLOCK Result ## The Main Result (Measured, Scoped) ## Reproducible PASS/BLOCK Separation on the Labeled Control Set (A/B/C) ## What We Learned About 3-Agent Disagreement (Important Correction) ## What the Data Showed About HRPO-X (Observer Mode) ## Why This Result Is Inspectable (Not Just a Metric) ## 1) Non-binding invariant checks ## 2) Dual-record disagreement metrics ## 3) Replay drift comparisons ## 4) Legacy carry-over contract ## Educational Code (Sanitized, IP-Safe) ## 1) Labeled PASS/BLOCK Benchmark (Control-Set Evaluation) ## 2) Non-Binding Invariant Check (Shadow Must Not Override Operational Verdicts) ## 3) Critic-Channel Shadow Rule (Observer-Only Routing Hint) ## 4) Replay Drift Compare (Metric Drift Without Hiding Behavior) ## Sanitized Real Artifact Example (Output, Not Pseudocode) ## Where CCGE Fit in This Experiment (Practical Use Case) ## Sydney Lens in Practice (Why It Matters Here) ## What This Article Does Not Claim ## GitHub Release (Artifacts + Sanitized Code) ## Why This Matters ## Next: EXP-033 ## Appendix: Internal Name Map (Quick Reference) If our internal module names are unfamiliar, read the system as five roles: I will mention internal names only after the role is clear. This article reports a scoped experimental result: EXP-032B (RCA/Fix + observer-shadow validation). This work follows three earlier threads: Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math From 97% Model Accuracy to 74% Clinical Reliability: Building RSN-NNSL-GATE-001 Trinity Protocol Part 2: When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement Those experiments answered: They did not answer the next practical question: Can the pipeline separate pass-worthy vs block-worthy cases reproducibly? That is the core question of EXP-032B. Fail-closed behavior is necessary, but not sufficient. A pipeline that blocks everything may be safe in one sense, but not usable. We used a labeled control setup and executed arm_a / arm_b / arm_c to test cross-arm consistency. This was not a threshold-tuning exercise. It was a root-cause repair plus re-measurement experiment. Instead of lowering thresholds until a PASS appeared, we: This distinction matters because it keeps failure attribution specific and the audit trail reconstructable. Figure 1: Role-first architecture diagram Multiple structure/model outputs are treated as cross-checking hypotheses, not a single source of truth. Three independent biomedical reasoning agents run in parallel. This layer monitors disagreement and either synthesizes or escalates. This is the formal governance module based on the RSN-NNSL-GATE-001 line of work,originally designed by Claire Hast (Founder, H3R.Tech). It evaluates component floors, end-to-end reliability (p_e2e), and governance conditions. An observer lens used to preserve expert-style skepticism around disagreement and uncertainty. It is not a ground-truth oracle. The framing of this lens was inspired by the scientific rigor and domain skepticism of Sydney Gordon (Principal Scientist, Antibody & ADC Sciences, Immunome). Table 1: Root-cause patch map Columns: Layer | Failure symptom | Patch action | Observed effect in RCA loop | Remaining risk. We found and patched multiple real causes of false blocking: PASS evaluations could be assessed with poorly aligned evidence artifacts. Measured impact in the RCA loop: An upstream inference path was collapsing to zero candidates under realistic settings. Measured impact in the RCA loop: A bio-domain signal path was effectively acting like a toy mapping and behaved poorly for protein-sequence inputs. Measured impact in the RCA loop: Some signals were over-compressed or difficult to inspect downstream. Measured impact in the RCA loop: The final control-set PASS/BLOCK separation should not be read as the effect of any single patch. In this experiment, the repairs played different roles: So the final metric outcome (balanced_accuracy = 1.0 on this control set) is best read as a stack-level repaired behavior, not a single-component win. Under EXP-032B observer-shadow conditions, we reproduced PASS/BLOCK separation across arm_a, arm_b, and arm_c. Control-set size (important context): Cross-arm consistency (A/B/C): Measured results (scoped setup): Labeling / evaluation context (important for interpreting the perfect score): This is the central result of EXP-032B. It is a scoped validation result under observer-shadow conditions. A simple reading might say: Our measurements did not support that simplification. What we observed on the measured set: That shifted the interpretation: Based on the measured score geometry and arbitration outputs in this control set, HRPO-X is better modeled as a structured critic/adversarial signal in observer mode than as a simple outlier vote. This lets disagreement remain visible without forcing every soft-discord case into the same interpretation. We added a non-binding observer shadow layer: In the measured EXP-032B set: This is an observer interpretation layer, not a production policy switch. Figure 2: Critic-channel shadow routing EXP-032B was designed to make the result explainable, not only reducible to headline metrics. We verify that shadow outputs do not overwrite operational verdict fields. We record two discord paths side by side: This makes disagreement metric drift observable. We compare runs across versions to detect behavioral and metric drift after patches. We preserved key evidence/reporting fields from earlier chaos experiments and checked them explicitly. This ensures improved results do not come at the cost of reduced transparency. Below are simplified educational snippets that reflect the validation patterns used in EXP-032B. These are not production implementations. They are included to make the logic auditable and easier to review. These examples are intentionally simplified, but they show the central idea of EXP-032B: The next section shows a sanitized payload example so readers can map these educational snippets to the actual field structure used in the experiment artifacts. In particular: Table 2: Educational code vs production role mapping To reduce ambiguity around internal names, here is a sanitized example of the kind of payload fields we actually inspect in EXP-032B (values shown are representative of the measured control-set runs): What this block is intended to show: In a previous post, I described the governance gate conceptually (RSN-NNSL-GATE-001). In EXP-032B, the formal module implementation (CCGE, CareChainGovernanceEngine) was used in a real observer-shadow workflow: This was useful because it separated: That separation made the patches more precise. As introduced in the architecture section, Sydney Lens is the observer lens we used to keep expert-style skepticism visible while repairing and re-measuring PASS/BLOCK behavior. In this experiment, its practical role was simple: This article reports completion of EXP-032B (RCA/Fix + observer-shadow validation), not final production closure of EXP-032. Still unresolved / deferred: Within a couple of days of publication, we will publish: The release will preserve reproducibility and decision traces while excluding IP-sensitive implementation details and environment-specific secrets. Publication discipline: The most important result of EXP-032B is not only the PASS/BLOCK split. It is that we can now show, with artifacts: That is a stronger foundation than a single headline metric. As in the earlier Trinity work, model disagreement remained signal, not noise; the difference in EXP-032B is that we could route and audit that signal without collapsing the entire result into a single opaque escalation story. EXP-033 starts from this locked carryover baseline and focuses on arbitration alignment: We are treating EXP-032B as a validation milestone, not a finish line. Figure 3: EXP-033 plan ladder If you build scientific AI systems, I would be interested in your view on this: How do you handle disagreement in multi-agent scientific reasoning without collapsing into either blind averaging or perpetual escalation? Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: from dataclasses import dataclass @dataclass class Row: expected_verdict: str # PASS_ELIGIBLE | BLOCK_EXPECTED actual_clinical_status: str # PASS | BLOCK def binary_verdict(row: Row) -> str: return "PASS_ELIGIBLE" if row.actual_clinical_status == "PASS" else "BLOCK_EXPECTED" def evaluate_rows(rows: list[Row]) -> dict: n_pass = sum(r.expected_verdict == "PASS_ELIGIBLE" for r in rows) n_block = sum(r.expected_verdict == "BLOCK_EXPECTED" for r in rows) fp_dangerous_pass = sum( r.expected_verdict == "BLOCK_EXPECTED" and binary_verdict(r) == "PASS_ELIGIBLE" for r in rows ) fn_false_reject = sum( r.expected_verdict == "PASS_ELIGIBLE" and binary_verdict(r) == "BLOCK_EXPECTED" for r in rows ) tp_pass = sum( r.expected_verdict == "PASS_ELIGIBLE" and binary_verdict(r) == "PASS_ELIGIBLE" for r in rows ) tn_block = sum( r.expected_verdict == "BLOCK_EXPECTED" and binary_verdict(r) == "BLOCK_EXPECTED" for r in rows ) dangerous_pass_rate = fp_dangerous_pass / n_block if n_block else None false_reject_rate = fn_false_reject / n_pass if n_pass else None pass_recall = tp_pass / n_pass if n_pass else None block_recall = tn_block / n_block if n_block else None balanced_accuracy = None if pass_recall is not None and block_recall is not None: balanced_accuracy = (pass_recall + block_recall) / 2.0 return { "dangerous_pass_rate": dangerous_pass_rate, "false_reject_rate": false_reject_rate, "balanced_accuracy": balanced_accuracy, } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: from dataclasses import dataclass @dataclass class Row: expected_verdict: str # PASS_ELIGIBLE | BLOCK_EXPECTED actual_clinical_status: str # PASS | BLOCK def binary_verdict(row: Row) -> str: return "PASS_ELIGIBLE" if row.actual_clinical_status == "PASS" else "BLOCK_EXPECTED" def evaluate_rows(rows: list[Row]) -> dict: n_pass = sum(r.expected_verdict == "PASS_ELIGIBLE" for r in rows) n_block = sum(r.expected_verdict == "BLOCK_EXPECTED" for r in rows) fp_dangerous_pass = sum( r.expected_verdict == "BLOCK_EXPECTED" and binary_verdict(r) == "PASS_ELIGIBLE" for r in rows ) fn_false_reject = sum( r.expected_verdict == "PASS_ELIGIBLE" and binary_verdict(r) == "BLOCK_EXPECTED" for r in rows ) tp_pass = sum( r.expected_verdict == "PASS_ELIGIBLE" and binary_verdict(r) == "PASS_ELIGIBLE" for r in rows ) tn_block = sum( r.expected_verdict == "BLOCK_EXPECTED" and binary_verdict(r) == "BLOCK_EXPECTED" for r in rows ) dangerous_pass_rate = fp_dangerous_pass / n_block if n_block else None false_reject_rate = fn_false_reject / n_pass if n_pass else None pass_recall = tp_pass / n_pass if n_pass else None block_recall = tn_block / n_block if n_block else None balanced_accuracy = None if pass_recall is not None and block_recall is not None: balanced_accuracy = (pass_recall + block_recall) / 2.0 return { "dangerous_pass_rate": dangerous_pass_rate, "false_reject_rate": false_reject_rate, "balanced_accuracy": balanced_accuracy, } COMMAND_BLOCK: from dataclasses import dataclass @dataclass class Row: expected_verdict: str # PASS_ELIGIBLE | BLOCK_EXPECTED actual_clinical_status: str # PASS | BLOCK def binary_verdict(row: Row) -> str: return "PASS_ELIGIBLE" if row.actual_clinical_status == "PASS" else "BLOCK_EXPECTED" def evaluate_rows(rows: list[Row]) -> dict: n_pass = sum(r.expected_verdict == "PASS_ELIGIBLE" for r in rows) n_block = sum(r.expected_verdict == "BLOCK_EXPECTED" for r in rows) fp_dangerous_pass = sum( r.expected_verdict == "BLOCK_EXPECTED" and binary_verdict(r) == "PASS_ELIGIBLE" for r in rows ) fn_false_reject = sum( r.expected_verdict == "PASS_ELIGIBLE" and binary_verdict(r) == "BLOCK_EXPECTED" for r in rows ) tp_pass = sum( r.expected_verdict == "PASS_ELIGIBLE" and binary_verdict(r) == "PASS_ELIGIBLE" for r in rows ) tn_block = sum( r.expected_verdict == "BLOCK_EXPECTED" and binary_verdict(r) == "BLOCK_EXPECTED" for r in rows ) dangerous_pass_rate = fp_dangerous_pass / n_block if n_block else None false_reject_rate = fn_false_reject / n_pass if n_pass else None pass_recall = tp_pass / n_pass if n_pass else None block_recall = tn_block / n_block if n_block else None balanced_accuracy = None if pass_recall is not None and block_recall is not None: balanced_accuracy = (pass_recall + block_recall) / 2.0 return { "dangerous_pass_rate": dangerous_pass_rate, "false_reject_rate": false_reject_rate, "balanced_accuracy": balanced_accuracy, } COMMAND_BLOCK: def check_non_binding_invariant(payload: dict) -> list[str]: errors = [] gov = payload.get("governance_status", {}) shadow = payload.get("critic_channel_shadow_assessment", {}) shadow_hint = gov.get("lawbinder_shadow_hint", {}) # Shadow exists only as observer guidance if shadow and shadow.get("non_binding") is not True: errors.append("critic_channel_shadow_assessment.non_binding must be true") if shadow_hint and shadow_hint.get("non_binding") is not True: errors.append("governance_status.lawbinder_shadow_hint.non_binding must be true") # Operational fields remain the source of record if gov.get("clinical_status") not in {"PASS", "BLOCK", "CONDITIONAL"}: errors.append("invalid operational clinical_status") if gov.get("lawbinder_decision") not in {"PASS", "ESCALATE", "INHIBIT"}: errors.append("invalid operational lawbinder_decision") return errors Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: def check_non_binding_invariant(payload: dict) -> list[str]: errors = [] gov = payload.get("governance_status", {}) shadow = payload.get("critic_channel_shadow_assessment", {}) shadow_hint = gov.get("lawbinder_shadow_hint", {}) # Shadow exists only as observer guidance if shadow and shadow.get("non_binding") is not True: errors.append("critic_channel_shadow_assessment.non_binding must be true") if shadow_hint and shadow_hint.get("non_binding") is not True: errors.append("governance_status.lawbinder_shadow_hint.non_binding must be true") # Operational fields remain the source of record if gov.get("clinical_status") not in {"PASS", "BLOCK", "CONDITIONAL"}: errors.append("invalid operational clinical_status") if gov.get("lawbinder_decision") not in {"PASS", "ESCALATE", "INHIBIT"}: errors.append("invalid operational lawbinder_decision") return errors COMMAND_BLOCK: def check_non_binding_invariant(payload: dict) -> list[str]: errors = [] gov = payload.get("governance_status", {}) shadow = payload.get("critic_channel_shadow_assessment", {}) shadow_hint = gov.get("lawbinder_shadow_hint", {}) # Shadow exists only as observer guidance if shadow and shadow.get("non_binding") is not True: errors.append("critic_channel_shadow_assessment.non_binding must be true") if shadow_hint and shadow_hint.get("non_binding") is not True: errors.append("governance_status.lawbinder_shadow_hint.non_binding must be true") # Operational fields remain the source of record if gov.get("clinical_status") not in {"PASS", "BLOCK", "CONDITIONAL"}: errors.append("invalid operational clinical_status") if gov.get("lawbinder_decision") not in {"PASS", "ESCALATE", "INHIBIT"}: errors.append("invalid operational lawbinder_decision") return errors COMMAND_BLOCK: def critic_channel_shadow_assessment( lawbinder_escalate_type: str, aats_irf_gap: float, hrpo_vs_aats_irf_mean: float, aats_psi_resonance: float, evidence_validation_passed: bool, ) -> dict: soft_discord_only = lawbinder_escalate_type == "ESCALATE_SOFT_DISCORD" # Example observer policy lock (EXP-032B B-track) bounded_ok = ( soft_discord_only and evidence_validation_passed and aats_irf_gap <= 0.25 and hrpo_vs_aats_irf_mean >= 0.015 and aats_psi_resonance >= 0.90 ) if bounded_ok: return { "shadow_verdict": "SHADOW_SOFT_ESCALATE_BOUNDED", "observer_operational_hint": "proceed_with_bounded_validation_plan", "non_binding": True, } return { "shadow_verdict": "SHADOW_STANDARD_ESCALATE", "observer_operational_hint": "hold_review_only", "non_binding": True, } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: def critic_channel_shadow_assessment( lawbinder_escalate_type: str, aats_irf_gap: float, hrpo_vs_aats_irf_mean: float, aats_psi_resonance: float, evidence_validation_passed: bool, ) -> dict: soft_discord_only = lawbinder_escalate_type == "ESCALATE_SOFT_DISCORD" # Example observer policy lock (EXP-032B B-track) bounded_ok = ( soft_discord_only and evidence_validation_passed and aats_irf_gap <= 0.25 and hrpo_vs_aats_irf_mean >= 0.015 and aats_psi_resonance >= 0.90 ) if bounded_ok: return { "shadow_verdict": "SHADOW_SOFT_ESCALATE_BOUNDED", "observer_operational_hint": "proceed_with_bounded_validation_plan", "non_binding": True, } return { "shadow_verdict": "SHADOW_STANDARD_ESCALATE", "observer_operational_hint": "hold_review_only", "non_binding": True, } COMMAND_BLOCK: def critic_channel_shadow_assessment( lawbinder_escalate_type: str, aats_irf_gap: float, hrpo_vs_aats_irf_mean: float, aats_psi_resonance: float, evidence_validation_passed: bool, ) -> dict: soft_discord_only = lawbinder_escalate_type == "ESCALATE_SOFT_DISCORD" # Example observer policy lock (EXP-032B B-track) bounded_ok = ( soft_discord_only and evidence_validation_passed and aats_irf_gap <= 0.25 and hrpo_vs_aats_irf_mean >= 0.015 and aats_psi_resonance >= 0.90 ) if bounded_ok: return { "shadow_verdict": "SHADOW_SOFT_ESCALATE_BOUNDED", "observer_operational_hint": "proceed_with_bounded_validation_plan", "non_binding": True, } return { "shadow_verdict": "SHADOW_STANDARD_ESCALATE", "observer_operational_hint": "hold_review_only", "non_binding": True, } COMMAND_BLOCK: def compare_disagreement_snapshots(baseline: dict, candidate: dict) -> dict: b = baseline["summary"] c = candidate["summary"] def _mean(stats: dict | None) -> float | None: if not stats: return None return stats.get("mean") return { "baseline_rows": b.get("n_rows"), "candidate_rows": c.get("n_rows"), "discord_mean_delta": (_mean(c.get("discord_score_stats")) or 0.0) - (_mean(b.get("discord_score_stats")) or 0.0), "discord_norm_mean_delta": (_mean(c.get("discord_score_normalized_stats")) or 0.0) - (_mean(b.get("discord_score_normalized_stats")) or 0.0), "discord_raw_mean_delta": (_mean(c.get("discord_score_rawtext_stable_stats")) or 0.0) - (_mean(b.get("discord_score_rawtext_stable_stats")) or 0.0), "decision_counts_changed": c.get("lawbinder_decision_counts") != b.get("lawbinder_decision_counts"), "taxonomy_counts_changed": c.get("lawbinder_escalate_type_counts") != b.get("lawbinder_escalate_type_counts"), } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: def compare_disagreement_snapshots(baseline: dict, candidate: dict) -> dict: b = baseline["summary"] c = candidate["summary"] def _mean(stats: dict | None) -> float | None: if not stats: return None return stats.get("mean") return { "baseline_rows": b.get("n_rows"), "candidate_rows": c.get("n_rows"), "discord_mean_delta": (_mean(c.get("discord_score_stats")) or 0.0) - (_mean(b.get("discord_score_stats")) or 0.0), "discord_norm_mean_delta": (_mean(c.get("discord_score_normalized_stats")) or 0.0) - (_mean(b.get("discord_score_normalized_stats")) or 0.0), "discord_raw_mean_delta": (_mean(c.get("discord_score_rawtext_stable_stats")) or 0.0) - (_mean(b.get("discord_score_rawtext_stable_stats")) or 0.0), "decision_counts_changed": c.get("lawbinder_decision_counts") != b.get("lawbinder_decision_counts"), "taxonomy_counts_changed": c.get("lawbinder_escalate_type_counts") != b.get("lawbinder_escalate_type_counts"), } COMMAND_BLOCK: def compare_disagreement_snapshots(baseline: dict, candidate: dict) -> dict: b = baseline["summary"] c = candidate["summary"] def _mean(stats: dict | None) -> float | None: if not stats: return None return stats.get("mean") return { "baseline_rows": b.get("n_rows"), "candidate_rows": c.get("n_rows"), "discord_mean_delta": (_mean(c.get("discord_score_stats")) or 0.0) - (_mean(b.get("discord_score_stats")) or 0.0), "discord_norm_mean_delta": (_mean(c.get("discord_score_normalized_stats")) or 0.0) - (_mean(b.get("discord_score_normalized_stats")) or 0.0), "discord_raw_mean_delta": (_mean(c.get("discord_score_rawtext_stable_stats")) or 0.0) - (_mean(b.get("discord_score_rawtext_stable_stats")) or 0.0), "decision_counts_changed": c.get("lawbinder_decision_counts") != b.get("lawbinder_decision_counts"), "taxonomy_counts_changed": c.get("lawbinder_escalate_type_counts") != b.get("lawbinder_escalate_type_counts"), } CODE_BLOCK: { "governance_status": { "clinical_status": "PASS", "lawbinder_decision": "ESCALATE", "lawbinder_escalate_type": "ESCALATE_SOFT_DISCORD", "lawbinder_shadow_hint": { "non_binding": true, "shadow_verdict": "SHADOW_SOFT_ESCALATE_BOUNDED", "observer_operational_hint": "proceed_with_bounded_validation_plan" } }, "critic_channel_shadow_assessment": { "enabled": true, "non_binding": true, "recommendation": "bounded_escalation_critic_channel", "bounded_escalation_eligible": true }, "lawbinder_signal_snapshot": { "discord_score": 1.0089, "discord_score_normalized": 1.0089, "discord_score_rawtext_stable": 0.9583, "discord_score_delta": 0.0506, "top_weight_engine": "weight_hrpo_x", "aats_psi_resonance": 1.0 } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: { "governance_status": { "clinical_status": "PASS", "lawbinder_decision": "ESCALATE", "lawbinder_escalate_type": "ESCALATE_SOFT_DISCORD", "lawbinder_shadow_hint": { "non_binding": true, "shadow_verdict": "SHADOW_SOFT_ESCALATE_BOUNDED", "observer_operational_hint": "proceed_with_bounded_validation_plan" } }, "critic_channel_shadow_assessment": { "enabled": true, "non_binding": true, "recommendation": "bounded_escalation_critic_channel", "bounded_escalation_eligible": true }, "lawbinder_signal_snapshot": { "discord_score": 1.0089, "discord_score_normalized": 1.0089, "discord_score_rawtext_stable": 0.9583, "discord_score_delta": 0.0506, "top_weight_engine": "weight_hrpo_x", "aats_psi_resonance": 1.0 } } CODE_BLOCK: { "governance_status": { "clinical_status": "PASS", "lawbinder_decision": "ESCALATE", "lawbinder_escalate_type": "ESCALATE_SOFT_DISCORD", "lawbinder_shadow_hint": { "non_binding": true, "shadow_verdict": "SHADOW_SOFT_ESCALATE_BOUNDED", "observer_operational_hint": "proceed_with_bounded_validation_plan" } }, "critic_channel_shadow_assessment": { "enabled": true, "non_binding": true, "recommendation": "bounded_escalation_critic_channel", "bounded_escalation_eligible": true }, "lawbinder_signal_snapshot": { "discord_score": 1.0089, "discord_score_normalized": 1.0089, "discord_score_rawtext_stable": 0.9583, "discord_score_delta": 0.0506, "top_weight_engine": "weight_hrpo_x", "aats_psi_resonance": 1.0 } } - structure validation stack - reasoning agents - arbitration layer - clinical governance gate - audit trail - Why EXP-032B Exists - The Core Experimental Question - What Changed in EXP-032B - Minimal Architecture - What We Actually Repaired (RCA) - The Main Result (Measured, Scoped) - 3-Agent Disagreement: What the Data Showed - Why This Result Is Inspectable - Educational Code (Sanitized) - Sanitized Real Artifact Example - CCGE in Practice - Sydney Lens in Practice - What This Article Does Not Claim - GitHub Release (Artifacts + Sanitized Code) - Why This Matters - Next: EXP-033 - Appendix: Internal Name Map - Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math We showed the pipeline can safely fail (BLOCK) under synthetic garbage inputs. - We showed the pipeline can safely fail (BLOCK) under synthetic garbage inputs. - From 97% Model Accuracy to 74% Clinical Reliability: Building RSN-NNSL-GATE-001 We framed the governance problem as an end-to-end reliability problem, not a single-model accuracy problem. - We framed the governance problem as an end-to-end reliability problem, not a single-model accuracy problem. - Trinity Protocol Part 2: When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement We showed that model disagreement is often signal, not noise. - We showed that model disagreement is often signal, not noise. - We showed the pipeline can safely fail (BLOCK) under synthetic garbage inputs. - We framed the governance problem as an end-to-end reliability problem, not a single-model accuracy problem. - We showed that model disagreement is often signal, not noise. - Can the system fail safely? - Can it detect disagreement? - Can we block BLOCK_EXPECTED samples? - Can we pass PASS_ELIGIBLE samples? - Can we show the result with reproducible artifacts (not just a narrative)? - identified why PASS rows were being blocked - patched the specific cause - re-ran and re-measured - repeated until pre-defined checks were satisfied for this scope (RCA movement was explainable, invariants passed, replay-drift artifacts were generated, and PASS/BLOCK labels were stable across A/B/C on the control set) - provenance checks - real-evidence pair prechecks - tighter sample/evidence pairing rules - removed a major source of artificial false blocking in PASS rows - made downstream governance failures interpretable as real component/gate issues instead of pairing noise - evidence span granularity improvements - query-anchored candidate text - formatting-noise reduction in candidate construction - moved upstream candidate/rank path from ranked=0 collapse to non-zero ranked outputs in probe runs - enabled real evidence injection into downstream observer-shadow validation - bio-domain path patch for protein-sequence usage - corrected signal propagation - YAML-based calibration in place of ad-hoc overrides - eliminated pathological signal collapse behavior in the bio-domain path - restored reproducible PASS/BLOCK separation without relying on ad-hoc CLI overrides - richer bridge signal snapshots - escalation taxonomy (soft-discord vs harder conditions) - explicit observer-shadow observability - made disagreement routing inspectable (soft-discord vs standard escalation) - enabled non-binding shadow validation with leakage checks and replay drift comparisons - made escalation categories operationally meaningful in observer mode (bounded validation candidate vs hold-for-review) - A (evidence/provenance pairing) removed artificial blocking noise and made downstream failures attributable - B (missing-link candidate/rank path) restored usable upstream evidence flow for observer-shadow evaluation - C (bio-domain signal path) removed unstable signal behavior and eliminated dependence on ad-hoc overrides - D (governance/arbitration observability) made routing, leakage, and drift observable enough to trust the measured result - n=2 labeled samples (1 PASS_ELIGIBLE, 1 BLOCK_EXPECTED) - 6 arm-level observations total (A/B/C for each sample) - this is a control-set validation result, not a generalization estimate - the purpose of this control set is behavioral reproducibility across pipeline versions (and across arms), not statistical generalization - for that scope (binary control behavior + cross-arm stability), n=2 is sufficient to test whether a version preserves or breaks the intended PASS/BLOCK routing behavior - PASS sample remained PASS in all three arms - BLOCK sample remained BLOCK in all three arms - no arm-specific flip was observed in this measured set - arm-level benchmark metrics matched the sample-level classification outcomes on this control set (balanced_accuracy = 1.0 across the 6 arm-level rows) - dangerous_pass_rate = 0.0 - false_reject_rate = 0.0 - balanced_accuracy = 1.0 - labels were pre-registered before reruns (PASS_ELIGIBLE, BLOCK_EXPECTED) - control-set manifest was used as the evaluation source of truth (artifact-first workflow) - labels were recorded as expert control labels in the manifest (expert_structural_label, rationale, confidence metadata) - this is a control-set reproducibility result, not a train/test generalization claim - AATS and HRPO-X both look relatively high, while IRF is the stricter (lower-scoring) signal - therefore IRF appears to be the main dissenter - LawBinder escalated all rows as discord-only (soft-discord) - HRPO-X was not the outlier - HRPO-X was often close to the AATS/IRF score geometry mean - yet it still received top fallback weight under conflict handling - in score-gap terms, diff_aats_irf was the largest gap, while hrpo_vs_aats_irf_mean remained small on this control set - pairwise distances also show HRPO-X sits between AATS and IRF (not as a clean two-model bloc with either side) - the main issue is not "HRPO-X is rogue" - the issue is how disagreement (discord) is computed and consumed - SHADOW_SOFT_ESCALATE_BOUNDED - SHADOW_STANDARD_ESCALATE - PASS rows mapped to bounded soft escalation (observer-only hint) - BLOCK rows mapped to standard escalation - no false bounded-escalation on block rows in the measured set - normalized path - rawtext-stable comparison path - separate operational verdicts from observer shadow logic - make drift measurable - keep the validation logic inspectable - check_non_binding_invariant() maps to governance_status.* and critic_channel_shadow_assessment.* - compare_disagreement_snapshots() maps to disagreement fields exposed under lawbinder_signal_snapshot.* - the internal component names are actual payload fields, not presentation labels - operational verdicts and observer shadow hints are explicitly separated - disagreement drift observability (dual-record) is recorded alongside the decision output - component floors - p_e2e structure - blocker tracing during RCA iterations - pass/block explanation support - reasoning disagreement - from governance-level reliability failures - do not treat disagreement as noise just because one score looks high - preserve bounded-validation routing context while avoiding premature confidence - final frozen-track closure (EXP-032A) - strict L3 grounding requirements - production arbitration alignment (LawBinder still escalates in these rows) - canonical disagreement metric selection (we are in a dual-record observation period) - the measured JSON artifacts (selected and organized) - a sanitized, IP-safe educational code subset - reproducibility-oriented reporting/check scripts - reproducibility review - methodology inspection - decision-trace transparency - If the GitHub release slips past the target window, we will update this post with a dated status note rather than leaving the timeline ambiguous. - what changed - why it changed - what remained unchanged - what is still unresolved - soft vs hard escalation separation - critic-channel routing rules - disagreement metric hardening/comparison - maintaining the same anti-leakage and replay-drift discipline - arbitration layer -> LawBinder - clinical governance gate -> CCGE - structural skepticism lens -> Sydney Lens - internal reliability/drift-adjacent signals -> SR9, DI2 - observer shadow routing -> critic_channel_shadow_assessment

🏷️ Tags

how-totutorialguidedev.toaimlserverroutingswitchgitgithub