Papers
Topics
Authors
Recent
2000 character limit reached

Visual grounding for desktop graphical user interfaces (2407.01558v2)

Published 5 May 2024 in cs.HC and cs.AI

Abstract: Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered AI agents. In this work, we present Instruction Visual Grounding or IVG, a multi-modal solution for object identification in a GUI. More precisely, given a natural language instruction and GUI screen, IVG locates the coordinates of the element on the screen where the instruction would be executed. To this end, we develop two methods. The first method is a three-part architecture that relies on a combination of a LLM and an object detection model. The second approach uses a multi-modal foundation model.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.