OmniACT Dataset: Multimodal Automation Benchmark
- OmniACT dataset is a multimodal benchmark combining screenshots, natural language instructions, and Python scripts to automate diverse desktop and web tasks.
- It leverages detailed action annotations and evaluation metrics, bridging natural language understanding, computer vision, and program synthesis.
- Comprising 9,802 task examples from various OS and web contexts, OmniACT supports robust assessment of autonomous agent performance.
OmniACT is a multimodal dataset and benchmark designed to advance research on generalist autonomous agents capable of automating tasks across both desktop and web environments. Distinguished by its breadth of application domains, extensive multimodality, and precise ground-truth action annotations, OmniACT facilitates the development and evaluation of agents that bridge natural language understanding, computer vision, and program synthesis for real-world computer automation (Kapoor et al., 2024).
1. Dataset Scope and Composition
OmniACT comprises 9,802 task examples distributed into train (6,789), validation (992), and test (2,021) splits. The dataset covers a wide operational spectrum, with a desktop-to-web task ratio of approximately 3:1. Desktop contexts account for 7,639 examples, subdivided into macOS (4,258), Windows (2,247), and Linux (1,134), while 2,163 tasks are sourced from diverse web pages. Each example models a task grounded in a specific screen—ranging from single-step commands like "Play the next song" to multi-step workflows such as "Send an email to John Doe mentioning the time and place to meet." Task categories, determined by user intent, include Shopping, Entertainment, Service, Government, Travel, and Health.
2. Modalities and Annotations
Each sample in OmniACT encodes three critical modalities:
- Screen Images: Full-screen PNG screenshots of native desktop applications or web pages are captured losslessly, with variations dictated by operating system or browser renderings.
- Natural Language Instructions: Tasks are specified via human-authored, unambiguous instructions, with three paraphrased variants per task to promote linguistic diversity and discourage lexical overfitting in learning systems.
- Script Annotations: Ground-truth action sequences are expressed as executable Python code invoking PyAutoGUI primitives (e.g., click, rightClick, doubleClick, moveTo, dragTo, scroll, hscroll, write, press, hotkey). Action sequences include precise numeric coordinates and command parameters.
The annotation pipeline proceeds through screen segmentation (manual PyQt5 tool for desktop; JavaScript DOM traversal for web), functional labeling by double annotators (Amazon Mechanical Turk, ≤5-word concise labels), task creation and script authoring by Python-capable college students, reverse mapping/filtering to convert functional labels to coordinates, syntactic verification with automated execution, and final human review for quality assurance.
3. Formal Task Definition and Scripting Environment
OmniACT’s principal challenge is defined as follows: Given a screenshot macOS, Windows, Linux, Webpage and a corresponding natural language instruction , an agent must generate a PyAutoGUI action sequence such that , when executed, successfully completes the task implied by on . The action space includes mouse and keyboard primitives: click, rightClick, doubleClick0, moveTo1, dragTo2, scroll3, hscroll4, write5, press6, hotkey7. All provided scripts are limited to documented PyAutoGUI API functions—no specialized wrappers are used—enabling research in generalizable automation across arbitrary web and desktop applications.
4. Benchmarking Metrics and Baseline Performance
OmniACT introduces three principal metrics for the benchmarking of autonomous agents:
- Task Success Rate: Fraction of tasks for which the predicted script executes the intended task.
- Sequence Score: For sample 8 with gold sequence length 9, 0 (with 1) if predicted and gold action types match exactly; 2 otherwise.
- Action Score: Continuous measure integrating coordinate, keypress, and string-match precision, computed as 3, where penalties 4, 5, and 6 reflect bounding-box miss distance, keypress mismatch, and BLEU-based string comparison, respectively.
Baseline evaluations reveal substantial gaps between current LLMs and human-level performance. On the test set, the strongest text LLM, GPT-4, achieves Sequence Score 32.75, Action Score 11.60—corresponding to roughly 15% of human upper bound (Sequence Score 82.23, Action Score 80.14). GPT-4V (vision preview) attains better performance (SS 39.43, AS 20.76 on a 500-sample subset), but still falls notably short of human proficiency. A representative comparison is shown in the table below:
| Model | Sequence Score | Action Score (% of human) |
|---|---|---|
| LLaMA-7B | 4.12 | 0.48 |
| Vicuna-13B | 5.44 | 1.78 |
| CodeLLaMA-34B | 10.09 | 3.72 |
| GPT-3.5-turbo | 22.85 | 7.89 |
| GPT-4 | 32.75 | 11.60 |
| LLaVA-v1.5-13B | 20.56 | 8.19 |
| GPT-4V (500 sample) | 39.43 | 20.76 |
| Human | 82.23 | 80.14 |
5. Prompting Strategies and Baseline Architectures
Baseline agent experiments span both prompt-only and fine-tuned LLMs, as well as vision-language architectures. Prompt templates include the full PyAutoGUI API, five in-context examples retrieved by MiniLM embedding similarity, and a DetACT-filtered list of UI elements (from OCR, icon, and color pipelines). Fine-tuning experiments utilize QLoRA on LLaMA-13B and Vicuna-13B at rank 64 and α=16 for 300 steps. Multimodal models—LLaVA-v1.5 (7B/13B) and GPT-4-vision-preview—are evaluated on a 500-sample multimodal subset. All baseline scripts are designed to maximize generality and transparency; no privileged API access or application-specific wrappers are employed.
6. Technical Challenges and Research Directions
The primary obstacles identified include visual grounding of UI elements in screenshots lacking DOM metadata, effective long-horizon plan generation (sequencing ≥5–10 actions reliably), fine-grained coordinate selection distinguishing "near miss" from precise targeting, and dynamic script debugging/recovery under stochastic interface changes.
Several research avenues are proposed: employing fine-grained UI element detectors (object-detection, OCR, and template matching fusion), leveraging reinforcement learning or self-play regimes for robust script induction and error correction, end-to-end multimodal pretraining that tightly couples language, vision, and interaction, and curriculum learning spanning simple-to-complex tasks—potentially via UI semantics graph representations.
7. Availability, Tools, and Usage
OmniACT is publicly released on GitHub under a permissive research license. Each dataset instance consists of image.png (full-screen screenshot), task.txt (instruction and ground-truth script), and box.json (bounding-box metadata for evaluation). The release includes open-source tools:
- DetACT module: OCR, SAM segmentation, icon-template matching, color extraction, and LLM-based element filtering pipelines.
- Evaluation scripts: Sequence Score and Action Score computation.
- Baseline scripts: Example PyAutoGUI code generation and inference.
- Demonstration environment: Script playback on mock screen contexts, requiring no privileged application APIs.
This integrated release is structured to enable reproducibility, swift benchmarking, and cumulative progress within the research community focused on multimodal agent automation and program synthesis (Kapoor et al., 2024).