Tag
#evaluation
5 insights
- ai · arxiv/cs.AI · 8 min
Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
May 2, 2026 Read → - ai · arxiv/cs.AI · 8 min
LATTICE: Measuring Crypto Agent Quality Beyond Accuracy
New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.
Apr 30, 2026 Read → - ai · arxiv/cs.LG · 8 min
Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows
New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.
Apr 29, 2026 Read → - ai · arxiv/cs.AI · 8 min
Rule-Based AI Needs Policy Grounding, Not Label Agreement
Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.
Apr 26, 2026 Read → - ai · arxiv/cs.LG · 8 min
LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring
A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.
Apr 17, 2026 Read →