Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 38 tok/s
GPT-5 High 37 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 466 tok/s Pro
Kimi K2 243 tok/s Pro
2000 character limit reached

Infogent: An Agent-Based Framework for Web Information Aggregation (2410.19054v1)

Published 24 Oct 2024 in cs.AI and cs.CL

Abstract: Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Mindsearch: Mimicking human minds elicits deep ai searcher. Preprint, arXiv:2407.20183.
  3. Lin Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, 2004.
  4. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
  5. Llm as os, agents as apps: Envisioning aios, agents and the aios-agent ecosystem. arXiv e-prints, pages arXiv–2312.
  6. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919.
  7. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553.
  8. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649.
  9. Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation. arXiv preprint arXiv:2409.12941.
  10. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations.
  11. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930.
  12. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  13. OpenAI. 2023. GPT-4V(ision) System Card.
  14. Smartbook: Ai-assisted situation report generation. arXiv preprint arXiv:2303.14337.
  15. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR.
  16. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  17. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345.
  18. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  19. Grounding open-domain instructions to automate web support tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1022–1032.
  20. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224.
  21. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
  22. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
  23. Assistantbench: Can web agents solve realistic and time-consuming tasks? Preprint, arXiv:2407.15711.
  24. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
  25. Gpt-4v(ision) is a generalist web agent, if grounded. In Forty-first International Conference on Machine Learning.
  26. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
  27. FanOutQA: A multi-hop, multi-document question answering benchmark for large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 18–37, Bangkok, Thailand. Association for Computational Linguistics.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a modular framework, Infogent, that integrates a Navigator, Extractor, and Aggregator for advanced web data extraction.
  • The paper demonstrates significant performance improvements over existing methods, with 6% to 9.3% gains on key benchmarks.
  • The paper highlights Infogent’s potential to enhance automated information synthesis in complex and visually demanding web environments.

Infogent: An Agent-Based Framework for Web Information Aggregation

The paper "Infogent: An Agent-Based Framework for Web Information Aggregation" introduces Infogent, a modular framework designed to address complex information aggregation tasks on the web. Unlike traditional web navigation tasks that focus on linear sequences leading to predefined goals, Infogent emphasizes exploratory web navigation necessary for comprehensive information gathering. This framework is particularly relevant for tasks that require synthesizing information from multiple sources to answer complex queries.

Framework Overview and Components

Infogent consists of three key components: Navigator, Extractor, and Aggregator, each playing a distinct role in information aggregation. The Navigator is tasked with exploring the web to identify relevant websites. It operates under two distinct information access settings: Direct API-Driven Access and Interactive Visual Access.

  • Direct API-Driven Access employs a tool-based LLM agent leveraging search APIs and automated scraping tools. In this setting, the Navigator uses tools for searching and extracting information, guiding the agent based on feedback from the Aggregator. Figure 1

    Figure 1: Overview of Infogent under the Direct API Access and Interactive Visual Access settings: The Navigator uses a tool-based LLM and a browser-controlling VLM as the web agent respectively, with the Aggregator's textual feedback guiding further navigation.

  • Interactive Visual Access mimics human-like browser interactions using a multimodal web agent to navigate visually complex web pages. The Navigator interacts with web interfaces requiring visual understanding and manual-like inputs, such as clicking, typing, and pressing enter, and additionally supports feedback-driven navigation through backtracking mechanisms. Figure 2

    Figure 2: A working example of Infogent. NG\mathcal{NG} iteratively generates an updated query given feedback from AG\mathcal{AG}.

The Extractor in both settings is responsible for identifying and extracting relevant content from selected web pages, using LLMs for textual data and multimodal models for screenshot-based data extraction in visual access scenarios.

The Aggregator assesses extracted content, updates the information stack, and provides feedback to the Navigator, enabling adaptive exploration and information synthesis. This feedback-driven interaction ensures dynamic adjustment in aggregation strategies.

Experimental Results

Infogent's efficacy was demonstrated on datasets requiring complex reasoning and multi-document aggregation. The framework outperformed existing state-of-the-art methods for both API-driven and interactive visual access tasks.

  • Direct API-Driven Access: Infogent achieved a 6% improvement over existing methods in FRAMES and 4.3% on AssistantBench, showcasing its ability to efficiently aggregate diverse information.

(Table 1)

Table 1: Results (in \%) on the Frames dataset for queries with different reasoning types under Direct API-Driven Access setting.

  • Interactive Visual Access: In tasks from AssistantBench, where visual interaction with the web is crucial, Infogent demonstrated a 9.3% improvement using advanced models, affirming its robustness in handling complex, information-dense webpages.

(Table 2)

Table 2: Accuracy (in \%) on AssistantBench in Interactive Visual Access Setting.

Implications and Future Directions

Infogent represents a significant advance in web navigation technology by addressing the challenges of information aggregation in both text-based and visually complex web environments. It provides a foundation for further research into improving web-based information synthesis, offering potential applications in areas such as automated report generation, comprehensive data collection for analysis, and enhanced search systems.

Future work will explore expanding Infogent's capabilities to handle more diverse and dynamic web environments, improving modular component interoperability, and incorporating real-time learning to adapt to rapidly changing web contexts.

Conclusion

Infogent showcases a novel approach to web information aggregation by using autonomous agents to interact with and extract diverse data from the web. Its modular design offers flexibility and adaptability across different web access settings, suggesting practical applications in varied fields requiring comprehensive data synthesis. Its ability to outperform existing frameworks highlights its potential as a robust tool in complex information aggregation tasks.

X Twitter Logo Streamline Icon: https://streamlinehq.com