Tag

#benchmark

10 insights

ai · arxiv/cs.AI · 8 min

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.

May 1, 2026 Read →
ai · arxiv/cs.AI · 8 min

LATTICE: Measuring Crypto Agent Quality Beyond Accuracy

New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.

Apr 30, 2026 Read →
ai · arxiv/cs.LG · 8 min

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Apr 29, 2026 Read →
ai · arxiv/cs.AI · 4 min

KuaiLive: First Real-Time Live Streaming Recommendation Dataset

Researchers release a 21-day interaction log from Kuaishou covering 23,772 users and 452,621 streamers to enable dynamic recommendation research.

Apr 27, 2026 Read →
ai · arxiv/cs.AI · 4 min

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.

Apr 17, 2026 Read →
ai · arxiv/cs.AI · 8 min

LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap

New benchmark shows large language models struggle with structured complexity tasks and require prohibitive compute to achieve reliability in formal reasoning.

Apr 17, 2026 Read →
ai · arxiv/cs.AI · 8 min

Vision-Language Models Fail on Dense Visual Grids

A new benchmark reveals VLMs collapse sharply on simple grid-reading tasks, exposing a gap between visual encoding and language output called Digital Agnosia.

Apr 17, 2026 Read →
ai · arxiv/cs.LG · 6 min

Speech Models Fail Safety Tests That Text Passes

VoxSafeBench reveals speech language models recognize social norms in text but ignore them when cues arrive through voice, speaker identity, or environment.

Apr 17, 2026 Read →
ai · arxiv/cs.LG · 6 min

Speech Models Fail Safety Tests That Text Models Pass

A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.

Apr 17, 2026 Read →
ai · arxiv/cs.LG · 4 min

Retrieval-Augmented Set Completion for Clinical Code Authoring

A two-stage approach retrieves similar clinical value sets then classifies candidates, outperforming direct LLM generation on standardized medical vocabularies.

Apr 17, 2026 Read →

#benchmark

LLMs Withhold Help When They Misread Intent, Not Lack Knowledge

LATTICE: Measuring Crypto Agent Quality Beyond Accuracy

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

KuaiLive: First Real-Time Live Streaming Recommendation Dataset

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap

Vision-Language Models Fail on Dense Visual Grids

Speech Models Fail Safety Tests That Text Passes

Speech Models Fail Safety Tests That Text Models Pass

Retrieval-Augmented Set Completion for Clinical Code Authoring