Search
3 results for "evaluation"
- ai · arxiv/cs.AI · 8 min
Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
May 2, 2026 Read → - engineering · arxiv/cs.LG · 4 min
Graph Neural Networks Cut QAOA Query Cost by 87%
A trust-region method using GNNs to predict QAOA parameter distributions reduces circuit evaluations while preserving solution quality on small graphs.
Apr 29, 2026 Read → - ai · arxiv/cs.LG · 8 min
LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring
A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.
Apr 17, 2026 Read →