Why does agreement with human labels fail for rule-governed AI?

When multiple decisions logically satisfy the same policy, agreement metrics treat valid alternatives as errors. A post might violate Rule A but not Rule B; both interpretations are defensible. Agreement-based evaluation penalizes this ambiguity as model failure, when it reflects rule ambiguity instead. Policy-grounded evaluation asks whether the decision follows from stated rules, not whether it matches a historical label.

What is the Probabilistic Defensibility Signal and how does it work?

The PDS derives reasoning stability from an LLM's internal token probabilities (logprobs) without requiring additional human audits. It measures how confidently the model assigns probability to rule-consistent decision paths. High PDS indicates the model's reasoning aligns with policy logic; low PDS flags ambiguous or unstable reasoning. This allows continuous governance monitoring at scale.

How much automation coverage can policy-grounded evaluation unlock?

In the Reddit moderation test, a Governance Gate built on defensibility signals achieved 78.6% automation coverage while reducing risk by 64.9% compared to agreement-based approaches. The key insight: 79.8–80.6% of decisions flagged as false negatives under agreement metrics were actually policy-consistent, so removing them from automation was unnecessary. Policy grounding recovers these valid decisions for safe automation.

ai · 8 min read · Apr 26, 2026

Rule-Based AI Needs Policy Grounding, Not Label Agreement

Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.

Source: arxiv/cs.AI · Michael O'Herlihy, Rosa Catal\`a · open original ↗

Agreement-based evaluation of rule-governed AI systems masks valid decisions as errors; policy-grounded correctness with defensibility signals fixes this.

— Agreement metrics penalize logically valid decisions when multiple rule-consistent outcomes exist.
— Defensibility Index measures whether a decision follows from stated policy rules.
— Ambiguity Index quantifies rule specificity gaps driving disagreement.
— Probabilistic Defensibility Signal derives reasoning stability from LLM token probabilities without extra audits.
— Reddit moderation test found 33–46.6 pp gap between agreement and policy-grounded scores.
— 79.8–80.6% of flagged false negatives were actually policy-consistent decisions.
— Governance Gate automation achieved 78.6% coverage with 64.9% risk reduction.
— Rule clarity directly reduces measured ambiguity; defensibility remains stable.

Astrobobo tool mapping

Knowledge Capture Document your organization's policy hierarchy in a structured format (decision tree, rule precedence matrix). Identify gaps where multiple outcomes are rule-consistent.
Focus Brief Create a one-page audit checklist: for each flagged decision, verify whether the model's reasoning chain follows stated rules, independent of human label agreement.
Daily Log Track ambiguous decisions daily. Tag by rule specificity level (clear, partial, conflicting). Over 2–4 weeks, identify which rule tiers drive most disagreement.

Frequently asked

When multiple decisions logically satisfy the same policy, agreement metrics treat valid alternatives as errors. A post might violate Rule A but not Rule B; both interpretations are defensible. Agreement-based evaluation penalizes this ambiguity as model failure, when it reflects rule ambiguity instead. Policy-grounded evaluation asks whether the decision follows from stated rules, not whether it matches a historical label.

Share X LinkedIn

cite ▸

APA

Michael O'Herlihy, Rosa Catal\`a. (2026, April 26). Rule-Based AI Needs Policy Grounding, Not Label Agreement. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/rule-based-ai-needs-policy-grounding-not-label-agreement-a6a5d9

MLA

Michael O'Herlihy, Rosa Catal\`a. "Rule-Based AI Needs Policy Grounding, Not Label Agreement." Astrobobo Content Engine, 26 Apr 2026, https://astrobobo-content-engine.vercel.app/article/rule-based-ai-needs-policy-grounding-not-label-agreement-a6a5d9. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.20972.

BibTeX

@misc{astrobobo_rule-based-ai-needs-policy-grounding-not-label-agreement-a6a5d9_2026,
  author       = {Michael O'Herlihy, Rosa Catal\`a},
  title        = {Rule-Based AI Needs Policy Grounding, Not Label Agreement},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/rule-based-ai-needs-policy-grounding-not-label-agreement-a6a5d9},
  note         = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.20972},
}

#moderation #evaluation #governance #llm #policy #defensibility

Rule-Based AI Needs Policy Grounding, Not Label Agreement

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs