Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CogAgent: A Visual Language Model for GUI Agents (2312.08914v2)

Published 14 Dec 2023 in cs.CV
CogAgent: A Visual Language Model for GUI Agents

Abstract: People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. LLMs such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual LLM (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual LLM, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM .

Introduction to Visual LLMs

The development of AI has led to the creation of Visual LLMs (VLMs) that can interpret and navigate Graphic User Interfaces (GUIs), an essential part of digital interaction today. These AI agents provide a new way to assist users in interacting with computers and smartphones through screens.

The Rise of CogAgent

Introducing CogAgent, an 18-billion-parameter VLM that specializes in understanding and automating tasks within GUI environments. Unlike standard models that struggle with image resolution constraints and limited textual input, CogAgent is engineered to operate with high-resolution input, allowing it to recognize small GUI elements and interpret text within images more effectively.

Architectural Advancements

CogAgent builds upon a foundation of VLMs but introduces a novel high-resolution cross-module. This allows the model to work with higher image resolutions without exponentially increasing computational costs. By incorporating both low-resolution and high-resolution image encoders, CogAgent is optimized to handle detailed visual features found in GUIs, like icons and embedded text.

Training and Evaluation

To train CogAgent, researchers constructed large-scale datasets for pre-training, focusing on recognizing various text fonts and sizes, as well as specific GUI elements and layouts. CogAgent was evaluated across several benchmarks, including text-rich visual question-answering (VQA) tasks and GUI navigation tests on both PC and Android platforms, showcasing leading performance.

The Future of AI Agents and VLMs

CogAgent represents a significant stride in the field of AI agents and VLMs. With its high-resolution input capabilities and efficient architecture, CogAgent holds promise for future research and applications in increasingly automated and AI-assisted digital interactions across various devices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Introducing our multimodal models, 2023.
  4. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  5. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023.
  6. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  7. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023a.
  8. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  10. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  11. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  13. Pp-ocr: A practical ultra lightweight ocr system. arXiv preprint arXiv:2009.09941, 2020.
  14. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  15. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
  16. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR, 2023.
  17. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  18. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  19. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  20. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  21. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  22. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
  23. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  24. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  25. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  26. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  27. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  28. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  29. OpenAI. Introducing chatgpt. 2022.
  30. OpenAI. Gpt-4 technical report, 2023.
  31. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023.
  32. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  33. Significant-Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023.
  34. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  35. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023a.
  36. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023b.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  39. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
  40. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023.
  41. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  42. Agenttuning: Enabling generalized agent abilities for llms. abs/2310.12823, 2023.
  43. You only look at screens: Multimodal chain-of-action agents. abs/2309.11436, 2023.
  44. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Wenyi Hong (14 papers)
  2. Weihan Wang (20 papers)
  3. Qingsong Lv (10 papers)
  4. Jiazheng Xu (10 papers)
  5. Wenmeng Yu (7 papers)
  6. Junhui Ji (3 papers)
  7. Yan Wang (733 papers)
  8. Zihan Wang (181 papers)
  9. Yuxiao Dong (119 papers)
  10. Ming Ding (219 papers)
  11. Jie Tang (302 papers)
  12. Yuxuan Zhang (119 papers)
  13. Juanzi Li (144 papers)
  14. Bin Xu (192 papers)
Citations (207)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com