Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tur[k]ingBench: A Challenge Benchmark for Web Agents (2403.11905v3)

Published 18 Mar 2024 in cs.AI, cs.CL, cs.CV, and cs.HC
Tur[k]ingBench: A Challenge Benchmark for Web Agents

Abstract: Can advanced multi-modal models effectively tackle complex web-based tasks? Such tasks are often found on crowdsourcing platforms, where crowdworkers engage in challenging micro-tasks within web-based environments. Building on this idea, we present TurkingBench, a benchmark consisting of tasks presented as web pages with textual instructions and multi-modal contexts. Unlike previous approaches that rely on artificially synthesized web pages, our benchmark uses natural HTML pages originally designed for crowdsourcing workers to perform various annotation tasks. Each task's HTML instructions are instantiated with different values derived from crowdsourcing tasks, creating diverse instances. This benchmark includes 32.2K instances spread across 158 tasks. To support the evaluation of TurkingBench, we have developed a framework that links chatbot responses to actions on web pages (e.g., modifying a text box, selecting a radio button). We assess the performance of cutting-edge private and open-source models, including language-only and vision-LLMs (such as GPT4 and InternVL), on this benchmark. Our results show that while these models outperform random chance, there is still significant room for improvement. We hope that this benchmark will drive progress in the evaluation and development of web-based agents.

TurkingBench: Evaluating Self-Supervised Models on Web-grounded Interaction Tasks

Introduction

The proliferation of conversational AI and chatbots has ushered in a new era of natural language understanding capabilities. Yet, the domain of web interaction, which encapsulates a significant portion of human-computer interaction, remains an uncharted territory for many existing models. Herein, we discuss a recent development, the TurkingBench benchmark, designed to assess the ability of self-supervised models in web-grounded instruction-following tasks. TurkingBench stands out by utilizing natural HTML pages originally crafted for crowdsourcing, making it a more realistic representation of tasks that humans encounter on the web.

Benchmark Description

TurkingBench comprises 158 diverse tasks, distributed across 32.2K instances, with each task presenting a blend of multimodal information and requiring various types of interactions with web elements. This benchmark is unique for several reasons:

  • Realistic Task Design: Unlike benchmarks that rely on synthesized web pages, TurkingBench employs actual HTML pages from crowdsourcing platforms, offering a genuine slice of the web's complexity.
  • Multimodal Contexts: Tasks within TurkingBench necessitate understanding not just text but also visual cues, as tasks are embedded in web pages that feature images, tables, and styled elements to convey instructions.
  • Interactive Nature: The benchmark facilitates evaluation through an innovative framework that interprets model responses as actions on web pages, such as text input, checkbox selections, and button clicks.

Evaluation Framework and Metrics

A comprehensive evaluation setup supports interaction with the benchmark, leveraging a Python-based library of "actions" models can perform on the web pages. This setup offers a realistic environment for testing model capabilities in web navigation and task completion. The evaluation metrics are adapted to the nature of the task responses, ranging from text input to selections, employing measures like ROUGE scores for text and exact match or set intersection measures for choice-based inputs.

Empirical Insights

Preliminary evaluation of leading self-supervised models, including GPT-4 and its variants, reveals several insights:

  • Performance Gap: While models like GPT-4 demonstrate a significant improvement over random baselines, achieving as high as 41.7% on tasks, a substantial gap exists when compared to an oracle model, suggesting substantial room for improvement.
  • Modality Sensitivity: The evaluation indicates that tasks benefit variably from models capable of processing multimodal inputs, hinting at the intricate role visual elements play in web-based tasks.
  • Prompt Length and Context Sensitivity: The performance variance between models handling "relevant" versus "full" HTML content highlights the models' sensitivity to prompt length and the granularity of provided context.

Implications and Future Directions

The TurkingBench benchmark, with its focus on web-grounded tasks, opens new avenues for understanding and enhancing the capabilities of AI models in the domain of human-computer interaction on the web. The notable performance gap presented by leading models underscores the complexity of web-based tasks and the necessity for models that can adeptly navigate and interact with such environments.

Future work could explore the integration of specialized web navigation models, the development of models that can effectively handle longer or more dense inputs, and techniques to better incorporate multimodal information. The ultimate aim is to bridge the gap in performance between AI models and the oracle, thereby pushing the boundaries of what AI can achieve in web-based interaction tasks.

TurkingBench not only offers a challenging testbed for current models but also lays down the groundwork for progressive enhancements in AI's understanding and interaction with the web, paving the way for more intuitive and efficient human-computer interfaces.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Kevin Xu (21 papers)
  2. Yeganeh Kordi (4 papers)
  3. Kate Sanders (19 papers)
  4. Yizhong Wang (42 papers)
  5. Adam Byerly (8 papers)
  6. Benjamin Van Durme (173 papers)
  7. Daniel Khashabi (83 papers)
  8. Tanay Nayak (3 papers)
  9. Ado Asija (1 paper)
  10. Jingyu Zhang (40 papers)
Citations (2)