CogAgent: A Visual Language Model for GUI Agents (2312.08914v2)

Published 14 Dec 2023 in cs.CV

Abstract: People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. LLMs such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual LLM (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual LLM, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM .

PDF HTML Abstract

Introduction to Visual LLMs

The development of AI has led to the creation of Visual LLMs (VLMs) that can interpret and navigate Graphic User Interfaces (GUIs), an essential part of digital interaction today. These AI agents provide a new way to assist users in interacting with computers and smartphones through screens.

The Rise of CogAgent

Introducing CogAgent, an 18-billion-parameter VLM that specializes in understanding and automating tasks within GUI environments. Unlike standard models that struggle with image resolution constraints and limited textual input, CogAgent is engineered to operate with high-resolution input, allowing it to recognize small GUI elements and interpret text within images more effectively.

Architectural Advancements

CogAgent builds upon a foundation of VLMs but introduces a novel high-resolution cross-module. This allows the model to work with higher image resolutions without exponentially increasing computational costs. By incorporating both low-resolution and high-resolution image encoders, CogAgent is optimized to handle detailed visual features found in GUIs, like icons and embedded text.

Training and Evaluation

To train CogAgent, researchers constructed large-scale datasets for pre-training, focusing on recognizing various text fonts and sizes, as well as specific GUI elements and layouts. CogAgent was evaluated across several benchmarks, including text-rich visual question-answering (VQA) tasks and GUI navigation tests on both PC and Android platforms, showcasing leading performance.

The Future of AI Agents and VLMs

CogAgent represents a significant stride in the field of AI agents and VLMs. With its high-resolution input capabilities and efficient architecture, CogAgent holds promise for future research and applications in increasingly automated and AI-assisted digital interactions across various devices.