TurkingBench: Evaluating Self-Supervised Models on Web-grounded Interaction Tasks
Introduction
The proliferation of conversational AI and chatbots has ushered in a new era of natural language understanding capabilities. Yet, the domain of web interaction, which encapsulates a significant portion of human-computer interaction, remains an uncharted territory for many existing models. Herein, we discuss a recent development, the TurkingBench benchmark, designed to assess the ability of self-supervised models in web-grounded instruction-following tasks. TurkingBench stands out by utilizing natural HTML pages originally crafted for crowdsourcing, making it a more realistic representation of tasks that humans encounter on the web.
Benchmark Description
TurkingBench comprises 158 diverse tasks, distributed across 32.2K instances, with each task presenting a blend of multimodal information and requiring various types of interactions with web elements. This benchmark is unique for several reasons:
- Realistic Task Design: Unlike benchmarks that rely on synthesized web pages, TurkingBench employs actual HTML pages from crowdsourcing platforms, offering a genuine slice of the web's complexity.
- Multimodal Contexts: Tasks within TurkingBench necessitate understanding not just text but also visual cues, as tasks are embedded in web pages that feature images, tables, and styled elements to convey instructions.
- Interactive Nature: The benchmark facilitates evaluation through an innovative framework that interprets model responses as actions on web pages, such as text input, checkbox selections, and button clicks.
Evaluation Framework and Metrics
A comprehensive evaluation setup supports interaction with the benchmark, leveraging a Python-based library of "actions" models can perform on the web pages. This setup offers a realistic environment for testing model capabilities in web navigation and task completion. The evaluation metrics are adapted to the nature of the task responses, ranging from text input to selections, employing measures like ROUGE scores for text and exact match or set intersection measures for choice-based inputs.
Empirical Insights
Preliminary evaluation of leading self-supervised models, including GPT-4 and its variants, reveals several insights:
- Performance Gap: While models like GPT-4 demonstrate a significant improvement over random baselines, achieving as high as 41.7% on tasks, a substantial gap exists when compared to an oracle model, suggesting substantial room for improvement.
- Modality Sensitivity: The evaluation indicates that tasks benefit variably from models capable of processing multimodal inputs, hinting at the intricate role visual elements play in web-based tasks.
- Prompt Length and Context Sensitivity: The performance variance between models handling "relevant" versus "full" HTML content highlights the models' sensitivity to prompt length and the granularity of provided context.
Implications and Future Directions
The TurkingBench benchmark, with its focus on web-grounded tasks, opens new avenues for understanding and enhancing the capabilities of AI models in the domain of human-computer interaction on the web. The notable performance gap presented by leading models underscores the complexity of web-based tasks and the necessity for models that can adeptly navigate and interact with such environments.
Future work could explore the integration of specialized web navigation models, the development of models that can effectively handle longer or more dense inputs, and techniques to better incorporate multimodal information. The ultimate aim is to bridge the gap in performance between AI models and the oracle, thereby pushing the boundaries of what AI can achieve in web-based interaction tasks.
TurkingBench not only offers a challenging testbed for current models but also lays down the groundwork for progressive enhancements in AI's understanding and interaction with the web, paving the way for more intuitive and efficient human-computer interfaces.