What is MERRIN and why does it matter?

MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.

Why do strong AI models perform poorly on MERRIN?

Strong models like Gemini Deep Research over-explore and waste resources on conflicting or partially relevant sources. They rely too heavily on text modalities and lack efficient source-selection strategies. Compared to humans, they consume more computational steps but achieve lower accuracy, suggesting the problem is not model capability but reasoning strategy and modality awareness.

How does MERRIN differ from prior search benchmarks?

MERRIN uses natural language queries without explicit hints about which modalities to prioritize, incorporates underexplored modalities like video and audio, and requires reasoning over genuinely conflicting multimodal evidence. Prior benchmarks often assume clean data or explicit modality labels, making MERRIN more reflective of real-world web search complexity.

ai · 4 min read · Apr 17, 2026

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.

Source: arxiv/cs.AI · Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal · open original ↗

MERRIN benchmark tests AI agents on retrieving and reasoning over multimodal, conflicting web evidence without explicit modality hints.

— Benchmark uses natural language queries without telling agents which modalities to prioritize.
— Incorporates video and audio alongside text, modalities often overlooked in prior benchmarks.
— Tests three search modes: no search, native search, and agentic search with tool use.
— Best agent achieves 40% accuracy; average across all agents is 22%.
— Strong models like Gemini Deep Research over-explore, wasting resources on conflicting sources.
— Agents rely too heavily on text and select sources inefficiently compared to human performance.
— Benchmark reflects real-world web search: underspecified queries, heterogeneous results, conflicting claims.

Astrobobo tool mapping

Reading Queue Add MERRIN paper and related multimodal benchmarks (e.g., WebVLM, Flamingo) to your queue to understand the landscape of multimodal evaluation.
Knowledge Capture Document your system's current modality-selection logic (explicit rules, learned weights, or heuristics). Note gaps where it defaults to text.
Focus Brief Create a brief on "multimodal confidence and source weighting" for your team—what signals indicate a source is reliable across modalities?

Frequently asked

MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.

Share X LinkedIn

cite ▸

APA

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal. (2026, April 17). MERRIN: Benchmark for Multimodal Search in Noisy Web Data. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f

MLA

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal. "MERRIN: Benchmark for Multimodal Search in Noisy Web Data." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.13418.

BibTeX

@misc{astrobobo_merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f_2026,
  author       = {Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal},
  title        = {MERRIN: Benchmark for Multimodal Search in Noisy Web Data},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f},
  note         = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.13418},
}

#multimodal #search #reasoning #benchmark #agents #retrieval

MERRIN: Benchmark for Multimodal Search in Noisy Web Data

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs