Monkey Vision-Language Model (VLM)

Updated 17 December 2025

Monkey Vision-Language Model (VLM) is a visual language model that processes instruction commands and UI screenshot trajectories to control mobile devices.
It uses a decoder-only transformer architecture, combining natural language, visual embeddings, and action tokens, exemplified by LLama+ViT and UI-VLM (Qwen-VL) variants.
Empirical evaluations on the AitW dataset demonstrate state-of-the-art performance, with UI-VLM (Qwen-VL) achieving a 78.9% partial-action match.

A Monkey Vision-LLM (VLM) is a visual LLM specifically designed to execute instruction-based control of mobile devices by interacting solely with their user interfaces (UIs). This agent leverages recent advances in LLMs and transformer-based multimodal learning to translate natural-language instructions and visual observations of device screens into sequences of human-like actions—including gestures such as tapping and swiping. Unlike prior agents limited to single-frame or explicitly structured UI data, the Monkey-VLM operates on vision–language “sentences” encoding trajectories of screenshots and executed actions. This architecture enables device-agnostic, general-purpose digital assistance across arbitrary applications by continuously observing and interacting with visual UI elements (Dorka et al., 12 Apr 2024).

1. Architectural Design

The Monkey-VLM is implemented as a decoder-only transformer that processes three modes of input: a natural-language instruction, a time-series of past UI screenshots, and a corresponding history of past actions (converted to textual form). Both visual (image) and textual (instruction, actions) modalities are embedded into a common $d$ -dimensional space and concatenated to form a unified token sequence, which is attended to by the transformer with standard autoregressive (causal) masking.

Two variants are detailed:

LLama+ViT: Utilizes a pretrained ViT-Base vision encoder (320M parameters, $384 \times 384$ input), projecting global [CLS] embeddings into the Llama-2-7B language backbone ( $d\approx4096$ ).
UI-VLM (Qwen-VL): Uses an OpenCLIP ViT-bigG encoder (input $448 \times 448$ , patch stride 14); outputs are condensed via single-layer cross-attention into 256 tokens of size $d$ , then interpolated within a Qwen-7B decoder-only backbone ( $d\approx4096$ ).

The multimodal fusion strategy precisely follows approaches such as BLIP-2 and PaLM-E, inserting visual tokens between and among text tokens, maintaining a consistent interface for sequence modeling regardless of input source. The input context at generation step $t$ comprises encoded instruction, ordered screenshot windows, and action tokens: $Z = \bigl[z_{\ell},\,z_{v,1},\,a_1,\,z_{v,2},\,a_2,\dots,z_{v,T},\,a_T\bigr]\quad\in\mathbb{R}^{L\times d}$ At inference, the next action text $a_{T+1}$ is produced autoregressively.

2. Input Modalities and Action Encoding

Model inputs are sequences that represent the agent’s visual and behavioral history:

Screenshot Embedding:
- LLama+ViT employs a single global [CLS] vector per screenshot, projected to $\mathbb{R}^d$ ,
- UI-VLM compresses $\approx 10^3$ ViT patch tokens to a fixed-length (256) sequence via cross-attention.
Vision–Language Trajectories: Training and inference operate on entire instruction–trajectory–action chains, as in paragraph-level LLM modeling.
Action Linearization: The action space is designed around six types—dual-point gestures (tap/swipe), text entry, and special keys (back, home, enter), plus task_complete/impossible status. These are linearized as short English descriptions encoding discrete (0–99) coordinates. Example commands include “tap at 7 90”, “swipe from 3 44 to 40 48”, “Input text "hello"”, and “press home”.

This representation facilitates generalization across the full range of Android UI schemas and app logic, as actions are expressed in a textual, device-independent form.

3. Training Regimen and Optimization Scheme

Training proceeds via next-token cross-entropy over the multimodal concatenated trajectory. For action tokens $a_{t,1},\ldots,a_{t,K_t}$ at time $t$ , the objective is: $\mathcal{L} = -\sum_{t=1}^T \sum_{k=1}^{K_t} \log\,p\bigl(a_{t,k}\mid I_{1:t},a_{1:t-1},a_{t,1:k-1}\bigr)$ Tokens from the instruction and projected images are omitted from the loss calculation.

Fine-tuning is performed with:

AdamW optimizer ( $\text{lr} = 3 \times 10^{-4}$ ),
Batch size 128, up to 5 epochs,
LoRA adapters (rank 32, $\alpha = 64$ , dropout = 0.05) on all attention weights in the LLM,
Frozen vision encoder and token embeddings; only projection and LoRA adapters are updated.

Training utilizes the Android in the Wild (AitW) dataset, featuring 715,000 episodes over 30,000 unique instructions, partitioned into five splits (GoogleApps, Install, WebShopping, General, Single) using an 80/10/10 split, with GoogleApps downsampled to 10% to address dataset imbalance. No external data augmentation is employed.

4. Empirical Results and Comparative Analysis

The primary evaluation metric is the partial-action match: the fraction of actions per episode correctly predicted, a measure correlated with end-to-end task completion as determined by human raters.

Test set results (partial match %) on AitW:

Model	Overall	GoogleApps	Install	WebShopping	General	Single
GPT4-V	53.0	49.2	46.1	48.2	43.0	78.3
SeeClick	59.8	57.7	64.5	57.3	56.0	63.6
MobileAgent	66.9	64.0	75.0	63.6	55.8	76.3
Auto-UI*	74.3	71.4	76.9	70.3	68.2	84.6
CogAgent	76.9	74.9	78.9	71.7	65.3	93.5
LLama+ViT	73.7	70.8	77.1	68.2	69.3	83.1
UI-VLM (Qwen-VL)	78.9	72.4	83.3	74.2	74.7	89.9

*Auto-UI limits swipe gestures to four cardinal directions; these results are not directly comparable.

The UI-VLM (Qwen-VL) achieves a new state-of-the-art performance of 78.9% overall partial match. Both architectural variants benefit from temporal screenshot histories and discretized, textual action encoding. Pretraining on OCR and vision–language grounding tasks further boosts UI-VLM over the LLama+ViT baseline (Dorka et al., 12 Apr 2024).

5. Generalization, Limitations, and Prospects

The UI-only, device-agnostic formulation permits the Monkey-VLM to function across all apps and web interfaces without modification. By processing a sequence of screen observations, the model partially addresses limitations imposed by single-frame-based policies in scenarios with partial observability or temporal dependencies.

However, several limitations and failure modes are noted:

UI elements absent from AitW (e.g., custom gestures) can lead to model errors.
Failures in OCR or visual encoding, particularly under challenging conditions (low contrast, small fonts), can result in incorrect button or element selection.
The action representation does not directly exploit explicit GUI element hierarchies, potentially ceding accuracy to agents leveraging explicit object or DOM structure (such as SeeClick or MobileAgent).

Future research avenues proposed include scaling to even larger VLM backbones (e.g., Gemini, GPT-4V), incorporating explicit GUI-element grounding, and leveraging online (possibly human-in-the-loop) feedback for continual adaptation to novel interface layouts. Incorporation of DOM or web hierarchy information is suggested as a path to alleviate certain classes of errors (Dorka et al., 12 Apr 2024).

6. Significance in the Context of Vision-Language Embodied Agents

The Monkey-VLM formulates UI control as an autoregressive, multimodal sequence generation task, unifying visual observation and action selection via language modeling techniques. This general-purpose framework—capable of learning from and generalizing across vast, weakly structured UI environments—establishes a new experimental and practical benchmark for vision-LLMs in device control. The approach substantiates the capacity of modern LLMs, when augmented with vision and trajectory history, to achieve state-of-the-art adaptation to the complexity of mobile ecosystems without reliance on app-specific instrumentation (Dorka et al., 12 Apr 2024).

PDF Markdown Chat (Pro)

References (1)

Training a Vision Language Model as Smartphone Assistant (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Monkey Vision-Language Model (VLM).