Tag
#benchmark
10 insights
- ai · arxiv/cs.AI · 8 min
LLMs Withhold Help When They Misread Intent, Not Lack Knowledge
A new benchmark reveals that language models often refuse benign requests due to misinterpreting user intent, and their ability to recover utility through clarification varies widely.
May 1, 2026 Read → - ai · arxiv/cs.AI · 8 min
LATTICE: Measuring Crypto Agent Quality Beyond Accuracy
New benchmark evaluates how well AI agents support user decisions in crypto, not just whether they get answers right.
Apr 30, 2026 Read → - ai · arxiv/cs.LG · 8 min
Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows
New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.
Apr 29, 2026 Read → - ai · arxiv/cs.AI · 4 min
KuaiLive: First Real-Time Live Streaming Recommendation Dataset
Researchers release a 21-day interaction log from Kuaishou covering 23,772 users and 452,621 streamers to enable dynamic recommendation research.
Apr 27, 2026 Read → - ai · arxiv/cs.AI · 4 min
MERRIN: Benchmark for Multimodal Search in Noisy Web Data
New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
LLMs hit formal reasoning ceiling; Chomsky Hierarchy reveals efficiency gap
New benchmark shows large language models struggle with structured complexity tasks and require prohibitive compute to achieve reliability in formal reasoning.
Apr 17, 2026 Read → - ai · arxiv/cs.AI · 8 min
Vision-Language Models Fail on Dense Visual Grids
A new benchmark reveals VLMs collapse sharply on simple grid-reading tasks, exposing a gap between visual encoding and language output called Digital Agnosia.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 6 min
Speech Models Fail Safety Tests That Text Passes
VoxSafeBench reveals speech language models recognize social norms in text but ignore them when cues arrive through voice, speaker identity, or environment.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 6 min
Speech Models Fail Safety Tests That Text Models Pass
A new benchmark reveals that speech language models drop safety, fairness, and privacy protections when cues arrive as audio rather than text.
Apr 17, 2026 Read → - ai · arxiv/cs.LG · 4 min
Retrieval-Augmented Set Completion for Clinical Code Authoring
A two-stage approach retrieves similar clinical value sets then classifies candidates, outperforming direct LLM generation on standardized medical vocabularies.
Apr 17, 2026 Read →