ai · 8 min read · Apr 29, 2026

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Source: arxiv/cs.LG · Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · open original ↗

Odysseys benchmark tests web agents on realistic multi-site, multi-hour tasks; frontier models succeed 44.5% of the time with poor efficiency.

  • Existing benchmarks use short single-site tasks where frontier models near saturation.
  • Real web work requires sustained context across multiple domains over hours.
  • Odysseys contains 200 tasks from actual browsing sessions evaluated live.
  • Binary pass/fail metrics inadequate; rubric-based evaluation with 6.1 graders per task provides finer signal.
  • Frontier models achieve 44.5% success rate on long-horizon tasks.
  • Trajectory Efficiency metric shows agents succeed slowly: only 1.15% rubric score per step.
  • Efficiency matters as much as correctness for practical agent deployment.
  • Benchmark released with tasks, evaluation code, and results.

Astrobobo tool mapping

  • Reading Queue Add the Odysseys paper and benchmark website to your queue; schedule 30 min to review task categories and failure patterns.
  • Knowledge Capture Document your agent's step count and rubric score on a sample task; create a template for tracking efficiency metrics across future runs.
  • Focus Brief Summarize the gap between your agent's success rate and 44.5%, and list the top 3 failure modes you observe on multi-site tasks.

Frequently asked

  • Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.
Share X LinkedIn
cite
APA
Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. (2026, April 29). Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218
MLA
Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. "Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows." Astrobobo Content Engine, 29 Apr 2026, https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.24964.
BibTeX
@misc{astrobobo_web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218_2026,
  author       = {Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov},
  title        = {Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.24964},
}

Related insights