MERRIN: Benchmark for Multimodal Search in Noisy Web Data
New benchmark reveals AI agents struggle with real-world web search, achieving only 22% accuracy when retrieving and reasoning across mixed media sources.
MERRIN benchmark tests AI agents on retrieving and reasoning over multimodal, conflicting web evidence without explicit modality hints.
- — Benchmark uses natural language queries without telling agents which modalities to prioritize.
- — Incorporates video and audio alongside text, modalities often overlooked in prior benchmarks.
- — Tests three search modes: no search, native search, and agentic search with tool use.
- — Best agent achieves 40% accuracy; average across all agents is 22%.
- — Strong models like Gemini Deep Research over-explore, wasting resources on conflicting sources.
- — Agents rely too heavily on text and select sources inefficiently compared to human performance.
- — Benchmark reflects real-world web search: underspecified queries, heterogeneous results, conflicting claims.
Astrobobo tool mapping
- Reading Queue Add MERRIN paper and related multimodal benchmarks (e.g., WebVLM, Flamingo) to your queue to understand the landscape of multimodal evaluation.
- Knowledge Capture Document your system's current modality-selection logic (explicit rules, learned weights, or heuristics). Note gaps where it defaults to text.
- Focus Brief Create a brief on "multimodal confidence and source weighting" for your team—what signals indicate a source is reliable across modalities?
Frequently asked
- MERRIN is a benchmark that tests AI agents on retrieving and reasoning over multimodal evidence (text, video, audio) from noisy web sources. It matters because real-world search queries are ambiguous and web results often conflict. MERRIN measures whether agents can decide which modalities are relevant and integrate contradictory information—tasks current agents perform poorly at, with best-in-class models reaching only 40% accuracy.
cite ▸
Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal. (2026, April 17). MERRIN: Benchmark for Multimodal Search in Noisy Web Data. Astrobobo Content Engine (rewrite of arxiv/cs.AI). https://astrobobo-content-engine.vercel.app/article/merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f
Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal. "MERRIN: Benchmark for Multimodal Search in Noisy Web Data." Astrobobo Content Engine, 17 Apr 2026, https://astrobobo-content-engine.vercel.app/article/merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f. Based on "arxiv/cs.AI", https://arxiv.org/abs/2604.13418.
@misc{astrobobo_merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f_2026,
author = {Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal},
title = {MERRIN: Benchmark for Multimodal Search in Noisy Web Data},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/merrin-benchmark-for-multimodal-search-in-noisy-web-data-24836f},
note = {Astrobobo rewrite of arxiv/cs.AI, https://arxiv.org/abs/2604.13418},
}