Astrobobo · Content Engine

Search

12 results for "benchmark"

ai · arxiv/cs.AI · 8 min

Benchmark Rubrics Shift LLM Scores in Financial NLP Tasks

How wording changes in evaluation criteria and metric selection alter model rankings on financial text benchmarks, requiring governance over gold-label assumptions.

May 2, 2026 Read →
ai · arxiv/cs.AI · 8 min

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.

May 1, 2026 Read →
ai · arxiv/cs.AI · 8 min

LATTICE: Measuring Crypto Agent Quality Beyond Accuracy

New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.

Apr 30, 2026 Read →
ai · arxiv/cs.LG · 8 min

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Apr 29, 2026 Read →
ai · arxiv/cs.LG · 4 min

Hyperbolic neural networks outperform Euclidean models in quantum simulations

Researchers demonstrate that Poincaré and Lorentz recurrent architectures consistently beat standard neural quantum states on many-body physics benchmarks.

Apr 28, 2026 Read →
ai · arxiv/cs.AI · 8 min

Trust-weighted SSL improves aerial image learning under corruption

Additive-residual trust weights boost self-supervised learning robustness when aerial images degrade, outperforming standard contrastive methods on benchmark datasets.

Apr 24, 2026 Read →
ai · arxiv/cs.LG · 8 min

Simple graph models match deep learning for molecular prediction

Classical topological indices enhanced with regularization and ensemble methods outperform neural networks on molecular property benchmarks without GPU requirements.

Apr 23, 2026 Read →
ai · arxiv/cs.AI · 6 min

AD-Copilot: Vision-Language Model Trained for Factory Defect Detection

Researchers built a specialized multimodal AI that compares paired industrial images to spot subtle manufacturing flaws, outperforming general-purpose models and human inspectors on benchmark tasks.

Apr 22, 2026 Read →
ai · arxiv/cs.AI · 4 min

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.

Apr 17, 2026 Read →
ai · arxiv/cs.AI · 8 min

LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap

New benchmark shows large language models struggle with structured complexity tasks and require prohibitive compute to achieve reliability in formal reasoning.

Apr 17, 2026 Read →
ai · arxiv/cs.AI · 8 min

Vision-Language Models Fail on Dense Visual Grids

A new benchmark reveals VLMs collapse sharply on simple grid-reading tasks, exposing a gap between visual encoding and language output called Digital Agnosia.

Apr 17, 2026 Read →
ai · arxiv/cs.LG · 6 min

Speech Models Fail Safety Tests That Text Models Pass

A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.

Apr 17, 2026 Read →