Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
Gold labels in financial NLP benchmarks are not objective; rubric wording and metric choice materially shift model rankings and require explicit governance.
- — Rubric wording changes alter model-assigned labels by 13–30 percentage points, especially near decision boundaries.
- — Not all metrics remain informative under real class distributions; within-one accuracy and worst-class accuracy are unreliable.
- — Exact accuracy, macro-F1, and weighted kappa are defensible metrics for the Japanese Financial Implicit-Commitment Recognition benchmark.
- — Ranking disagreement emerges when using all five metrics but vanishes when restricted to identifiable metrics.
- — Measurement risk arises from confounded rubric variants that mix semantics, examples, and verbosity without isolating causes.
- — Supervised financial benchmarks need explicit reporting discipline on rubric governance and metric selection, not just new leaderboards.
Astrobobo tool mapping
- Knowledge Capture Document your benchmark's rubric, metric definitions, and class distribution in a structured template. Include rationale for each choice and flag confounded variables.
- Focus Brief Summarize the three findings (rubric sensitivity, metric informativeness, ranking defensibility) and map them to your own benchmark's design. Identify gaps.
- Daily Log Track rubric changes, metric audits, and ranking disagreements as you iterate on your benchmark. Use this log to build a governance narrative for stakeholders.
Frequently asked
- Rubric wording changes alter how annotators and models interpret boundary cases, especially near decision thresholds (e.g., whether a statement is an implicit commitment or not). The study found agreement between rubric variants ranged from 70% to 83%, meaning 17–30% of labels shifted. This is not random noise; it reflects genuine ambiguity in the rubric itself, not the model's capability.
cite ▸
Sidi Chang, Peiying Zhu, Yuxiao Chen, Rongdong Chai. (2026, May 2). Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/benchmark-rubrics-shift-llm-scores-in-financial-nlp-tasks-98237e
Sidi Chang, Peiying Zhu, Yuxiao Chen, Rongdong Chai. "Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks." Astrobobo Content Engine, 2 May 2026, https://astrobobo-content-engine.vercel.app/article/benchmark-rubrics-shift-llm-scores-in-financial-nlp-tasks-98237e. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.27374.
@misc{astrobobo_benchmark-rubrics-shift-llm-scores-in-financial-nlp-tasks-98237e_2026,
author = {Sidi Chang, Peiying Zhu, Yuxiao Chen, Rongdong Chai},
title = {Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/benchmark-rubrics-shift-llm-scores-in-financial-nlp-tasks-98237e},
note = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.27374},
}