Search
12 results for "benchmark"
- ai · arxiv/cs.AI · 8 min
Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks
How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.
May 2, 2026 Read → - ai · arxiv/cs.AI · 8 min
LLMs Withhold Help When They Misread Intent, Not Lack Knowledge
A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.
May 1, 2026 Read → - ai · arxiv/cs.AI · 8 min
LATTICE: Measuring Crypto Agent Quality Beyond Accuracy
New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.
Apr 30, 2026 Read → - ai · arxiv/cs.LG · 8 min
Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows
New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.
Apr 29, 2026 Read → - ai · arxiv/cs.LG · 4 min
Hyperbolic neural networks outperform Euclidean models in quantum simulations
Researchers demonstrate that Poincaré and Lorentz recurrent architectures consistently beat standard neural quantum states on many-body physics benchmarks.
Apr 28, 2026 Read → - ai · arxiv/cs.AI · 8 min
Trust-weighted SSL improves aerial image learning under corruption
Additive-residual trust weights boost self-supervised learning robustness when aerial images degrade, outperforming standard contrastive methods on benchmark datasets.
Apr 24, 2026 Read → - ai · arxiv/cs.LG · 8 min
Simple graph models match deep learning for molecular prediction
Classical topological indices enhanced with regularization and ensemble methods outperform neural networks on molecular property benchmarks without GPU requirements.
Apr 23, 2026 Read → - ai · arxiv/cs.AI · 6 min
AD-Copilot: Vision-Language Model Trained for Factory Defect Detection
Researchers built a specialized multimodal AI that compares paired industrial images to spot subtle manufacturing flaws, outperforming general-purpose models and human inspectors on benchmark tasks.
Apr 22, 2026 Read → - ai · arxiv/cs.AI · 4 min
MERRIN: Benchmark for Multimodal Search in Noisy Web Data
New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap
New benchmark shows large language models struggle with structured complexity tasks and require prohibitive compute to achieve reliability in formal reasoning.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
Vision-Language Models Fail on Dense Visual Grids
A new benchmark reveals VLMs collapse sharply on simple grid-reading tasks, exposing a gap between visual encoding and language output called Digital Agnosia.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 6 min
Speech Models Fail Safety Tests That Text Models Pass
A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.
Apr 17, 2026 Read →