Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows
New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.
Odysseys benchmark tests web agents on realistic multi-site, multi-hour tasks; frontier models succeed 44.5% of the time with poor efficiency.
- — Existing benchmarks use short single-site tasks where frontier models near saturation.
- — Real web work requires sustained context across multiple domains over hours.
- — Odysseys contains 200 tasks from actual browsing sessions evaluated live.
- — Binary pass/fail metrics inadequate; rubric-based evaluation with 6.1 graders per task provides finer signal.
- — Frontier models achieve 44.5% success rate on long-horizon tasks.
- — Trajectory Efficiency metric shows agents succeed slowly: only 1.15% rubric score per step.
- — Efficiency matters as much as correctness for practical agent deployment.
- — Benchmark released with tasks, evaluation code, and results.
Astrobobo tool mapping
- Reading Queue Add the Odysseys paper and benchmark website to your queue; schedule 30 min to review task categories and failure patterns.
- Knowledge Capture Document your agent's step count and rubric score on a sample task; create a template for tracking efficiency metrics across future runs.
- Focus Brief Summarize the gap between your agent's success rate and 44.5%, and list the top 3 failure modes you observe on multi-site tasks.
Frequently asked
- Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.
cite ▸
Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. (2026, April 29). Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218
Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. "Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows." Astrobobo Content Engine, 29 Apr 2026, https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.24964.
@misc{astrobobo_web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218_2026,
author = {Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov},
title = {Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows},
year = {2026},
url = {https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218},
note = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.24964},
}