Why do existing web agent benchmarks fail to measure real-world performance?

Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.

What is Trajectory Efficiency and why is it important?

Trajectory Efficiency measures rubric score per step taken by an agent. Odysseys found that frontier models achieve only 1.15%, meaning they succeed slowly and inefficiently. This metric matters because an agent that solves tasks in 100 steps is economically unviable compared to one solving them in 10 steps, even if both succeed eventually.

How does Odysseys evaluate long-horizon tasks differently from other benchmarks?

Odysseys uses rubric-based evaluation instead of binary pass/fail, annotating each task with an average of 6.1 graded rubrics. This provides finer-grained signal and higher agreement with human judgment than trajectory-level LLM-as-a-judge metrics, capturing partial progress and reasoning quality on complex multi-step tasks.

ai · 8 min read · Apr 29, 2026

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

New benchmark reveals frontier AI models achieve only 44.5% success on long-horizon web tasks spanning multiple sites, exposing efficiency gaps in agent design.

Source: arxiv/cs.LG · Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov · open original ↗

Odysseys benchmark tests web agents on realistic multi-site, multi-hour tasks; frontier models succeed 44.5% of the time with poor efficiency.

— Existing benchmarks use short single-site tasks where frontier models near saturation.
— Real web work requires sustained context across multiple domains over hours.
— Odysseys contains 200 tasks from actual browsing sessions evaluated live.
— Binary pass/fail metrics inadequate; rubric-based evaluation with 6.1 graders per task provides finer signal.
— Frontier models achieve 44.5% success rate on long-horizon tasks.
— Trajectory Efficiency metric shows agents succeed slowly: only 1.15% rubric score per step.
— Efficiency matters as much as correctness for practical agent deployment.
— Benchmark released with tasks, evaluation code, and results.

Astrobobo tool mapping

Reading Queue Add the Odysseys paper and benchmark website to your queue; schedule 30 min to review task categories and failure patterns.
Knowledge Capture Document your agent's step count and rubric score on a sample task; create a template for tracking efficiency metrics across future runs.
Focus Brief Summarize the gap between your agent's success rate and 44.5%, and list the top 3 failure modes you observe on multi-site tasks.

Frequently asked

Current benchmarks focus on short, single-site tasks where frontier models are already near saturation. Real web work—comparing products across domains, planning trips, synthesizing research—requires sustained context and cross-site reasoning over hours. Short benchmarks miss these long-horizon challenges and don't reveal efficiency gaps that matter in production.

Share X LinkedIn

cite ▸

APA

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. (2026, April 29). Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows. Astrobobo Content Engine (rewrite of arxiv/cs.LG). https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218

MLA

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov. "Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows." Astrobobo Content Engine, 29 Apr 2026, https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218. Based on "arxiv/cs.LG", https://arxiv.org/abs/2604.24964.

BibTeX

@misc{astrobobo_web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218_2026,
  author       = {Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov},
  title        = {Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows},
  year         = {2026},
  url          = {https://astrobobo-content-engine.vercel.app/article/web-agents-plateau-on-short-tasks-odysseys-benchmark-tests-realistic-multi-hour--23c218},
  note         = {Astrobobo rewrite of arxiv/cs.LG, https://arxiv.org/abs/2604.24964},
}

#webagents #benchmark #longhorizon #evaluation #efficiency

Web agents plateau on short tasks; Odysseys benchmark tests realistic multi-hour workflows

Astrobobo tool mapping

Frequently asked

Related insights

Synthetic Computers Enable Agent Training at Scale

ActiNet: Self-Supervised Model Improves Wrist Activity Classification

Mixed Precision Training Stabilizes Neural ODEs