Astrobobo · Content Engine

Tag

#evaluation

5 insights

ai · arxiv/cs.AI · 8 min

Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks

How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.

May 2, 2026 Read →
ai · arxiv/cs.AI · 8 min

LATTICE: Measuring Crypto Agent Quality Beyond Accuracy

New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.

Apr 30, 2026 Read →
ai · arxiv/cs.LG · 8 min

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Apr 29, 2026 Read →
ai · arxiv/cs.AI · 8 min

Rule-Based AI Needs Policy Grounding, Not Label Agreement

Content moderation systems fail when evaluated by human agreement alone. A new framework measures whether decisions logically follow stated rules instead.

Apr 26, 2026 Read →
ai · arxiv/cs.LG · 8 min

LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring

A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.

Apr 17, 2026 Read →