Astrobobo · Content Engine

Search

3 results for "evaluation"

ai · arxiv/cs.AI · 8 min

Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks

How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.

May 2, 2026 Read →
engineering · arxiv/cs.LG · 4 min

Graph Neural Networks Cut QAOA Query Cost by 87%

A trust-region method using GNNs to predict QAOA parameter distributions reduces circuit evaluations while preserving solution quality on small graphs.

Apr 29, 2026 Read →
ai · arxiv/cs.LG · 8 min

LLM Panels Match Expert Clinicians in Medical Diagnosis Scoring

A study of three frontier AI models scoring real hospital cases shows calibrated LLM juries can reliably replace human expert panels for medical AI evaluation.

Apr 17, 2026 Read →