Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? (2407.15711v2)

Published 22 Jul 2024 in cs.CL

Abstract: Language agents, built on top of LLMs (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including LLMs and retrieval-augmented LLMs, as no model reaches an accuracy of more than 26 points. While closed-book LMs perform well in terms of accuracy, they exhibit low precision and tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that open web navigation remains a major challenge.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ori Yoran (13 papers)
  2. Samuel Joseph Amouyal (5 papers)
  3. Chaitanya Malaviya (24 papers)
  4. Ben Bogin (22 papers)
  5. Ofir Press (21 papers)
  6. Jonathan Berant (107 papers)
Citations (10)