SeeClick: Visual Grounding for GUI Automation
- SeeClick is a visual grounding approach that maps UI elements using mouse clicks on screenshots, enabling precise action localization.
- It replaces traditional eye-tracking with click-based attention mapping, achieving high correlation with human fixation patterns.
- Neural models implementing SeeClick significantly improve GUI task success rates across mobile, desktop, and web environments.
SeeClick refers to a class of visual grounding techniques and systems designed for locating and interacting with UI elements on graphical user interfaces (GUIs) using screenshots as the only input modality. It encompasses both experimental paradigms for measuring human attention via mouse clicks (as in crowdsourcing or behavioral studies) and, more recently, large-scale neural models for end-to-end GUI grounding in autonomous agents. The SeeClick paradigm is now most prominently associated with a family of neural models introduced in the context of GUI automation and embodied visual agents, where it enables precise action localization without reliance on structured environment metadata such as XML or HTML (Cheng et al., 2024). This entry covers both roots of the SeeClick approach—from its behavioral/crowdsourcing origins (e.g., BubbleView (Kim et al., 2017)) to its instantiation as a modern neural architecture that advances the state-of-the-art in multimodal GUI agents (Cheng et al., 2024).
1. Foundations: See-Click as a Behavioral Paradigm
The original “See-Click” interface paradigm was introduced to approximate human eye fixations using discrete mouse clicks within a blurred “moving window” interface, as operationalized by BubbleView (Kim et al., 2017). In this protocol, participants view images (visualizations, scenes, web pages) blurred by a Gaussian kernel (σ{\text{blur}}), and can click to reveal local unblurred “bubbles” (radius r{\text{bubble}}), mimicking the effect of foveal vision. The system logs each click, generating a spatial “importance map” that quantifies which regions users voluntarily inspected.
Key parameters include:
- σ_{\text{blur}} (30–50 px for 500–1000 px images), ensuring that text and details outside the bubble cannot be read.
- r_{\text{bubble}} (24–40 px), balancing sampling precision and speed.
- Click logging, with each (x_i, y_i, t_i) recorded for later analysis.
Data from See-Click experiments can be convolved to yield smooth maps comparable to eye-tracking fixation maps. Quantitative metrics such as Pearson’s correlation coefficient and normalized scanpath saliency (NSS) measure the degree to which click patterns approximate true visual attention. On information-rich tasks (info visualizations, web pages), BubbleView/See-Click maps explain 80–90% of fixation variance with ≈12–15 participants per image; element-wise importance correlations range from ρ ≈ 0.96 (charts) to r ≈ 0.66 (graphic design) (Kim et al., 2017).
2. Neural SeeClick: Model Architecture and Inference
SeeClick, as formalized in recent neural GUI agent literature (Cheng et al., 2024), is a vision-LLM designed for GUI grounding—the task of mapping from (screenshot, natural language instruction) pairs to actionable locations (points or bounding-boxes) within the screen. The canonical SeeClick implementation is built atop the Qwen-VL vision-language foundation (ViT visual encoder, transformer LLM, ≈9.6B parameters), with cross-attention connectors and LoRA fine-tuning.
The model receives:
- s: a screenshot (e.g., 448×448 RGB)
- x: a natural-language instruction (e.g., "Click the Gmail icon")
- (optionally) a history of the last k=4 actions
It autoregressively generates the next action, either click(x, y) (normalized to [0,1]2) or UI action tokens (type, select, swipe, etc.) (Cheng et al., 2024). Unlike prompt-based MLLMs (e.g., GPT-4V), SeeClick is explicitly trained for spatial grounding with dense supervision.
A plausible alternative architecture, as inferred from related benchmarks and ClickAgent’s integration, involves a two-stage object detector: a CNN or transformer backbone producing feature maps, followed by region proposal and prediction heads for bounding-box and class logits. While specific layers are not published for SeeClick-9.6B, similar models employ a deep CNN backbone (e.g., ResNet-50/FPN) with multi-head classification and box regression.
3. GUI Grounding: Definition, Data, and Training
GUI grounding in SeeClick is defined as learning the conditional p(y|s,x), where y is a point (x, y) or bounding-box (l, t, r, b) in normalized coordinates—the spatial answer to an instruction. Training uses next-token cross-entropy over the sequence encoding the action, with floats rendered as plain text tokens.
The core pre-training set (≈1M samples) covers:
- Web grounding: text₂point, text₂bbox, point₂text, bbox₂text pairs from HTML-rendered screenshots, using bounding-boxes and element text from the DOM (≈433K samples).
- Mobile UI: widget captioning and auto-extracted UI element boxes from datasets such as RICO, plus screen summarization (≈420K samples).
- General VQA: LLaVA-derived instruction-data (≈145K samples). Annotation is fully automated by scripting HTML screenshot/DOM pairs and RICO supervision.
Training settings: AdamW optimizer, initial learning rate 3e−5, batch size 64, cosine annealing, ≈10K steps (24 hours on 8×A100 GPUs). LoRA is applied to both visual and language layers (rank 8, α=16).
4. Empirical Performance and Benchmarking
SeeClick establishes state-of-the-art GUI grounding accuracy on the ScreenSpot benchmark (mobile, web, desktop environments), as evaluated via ClickAcc—the fraction where the predicted point falls within the ground-truth box. Performance is systematically reported across device types and baselines:
| Model | Mobile Text | Desktop Text | Web Text | Avg. ClickAcc |
|---|---|---|---|---|
| MiniGPT-v2 (7B) | 8.4% | 5.7% | 6.5% | 5.7% |
| Qwen-VL (9.6B) | 9.5% | 5.7% | 3.5% | 5.2% |
| GPT-4V | 22.6% | 20.2% | 9.2% | 16.2% |
| CogAgent (18B) | 67.0% | 74.2% | 70.4% | 47.4% |
| SeeClick (9.6B) | 78.0% | 72.2% | 55.7% | 53.4% |
On downstream agent tasks, SeeClick achieves:
- MiniWob (vision-only, 35 tasks): 67.0% success (vs 64.6% Pix2Act, 48.4% Qwen-VL)
- AITW (Android): ClickAcc 66.4%, overall task score 59.3% (+9pt over Qwen-VL)
- Mind2Web (real website navigation, no HTML): Step SR 25.5%, +12.2pt over Qwen-VL (Cheng et al., 2024)
Advances in grounding accuracy consistently yield proportional improvements on downstream automation metrics (Cheng et al., 2024).
5. Role in Autonomous Agents and System Integration
Within agent frameworks such as ClickAgent (Hoscilowicz et al., 2024), SeeClick serves as the UI element localizer within a modular architecture: the decision/planning module (e.g., InternVL2.0) generates natural-language prompts describing target elements; SeeClick processes the current screenshot and prompt to return bounding-boxes (or click coordinates) with confidence scores. ClickAgent thresholds predictions (score s ≥ 0.5), otherwise re-prompts the planner for clarification if no confident box is detected. This design enables robust automation in environments—especially Android and web apps—where structured metadata (HTML/XML) or OCR is unreliable or unavailable.
A specialized, pure-visual grounding approach enables handling of arbitrary interface styles and visual variability, outperforming both prompt-based MLLMs and previous end-to-end models by a substantial margin (∼25 percentage point improvement in task success over end-to-end MLLMs in AITW evaluated by ClickAgent (Hoscilowicz et al., 2024)).
6. Significance, Limitations, and Extensions
SeeClick’s empirical findings demonstrate that enhanced GUI grounding directly translates to superior downstream agent performance, with monotonic improvement observed as grounding accuracy increases across checkpoints and benchmarks (Cheng et al., 2024). This supports the central thesis that robust visual grounding—rather than complex reasoning chains or explicit DOM parsing—is the primary bottleneck in visual agent control of GUIs.
Limitations remain:
- Pre-tokenized click/box outputs can restrict localization granularity (no coordinate tokenization in SeeClick).
- Slow exploration or under-sampling of interactive elements may occur in human “see-click” data collection protocols (e.g., BubbleView), limiting data for rarer or less salient elements (Kim et al., 2017).
- Current architectures may not fully capture unconscious/pre-attentive salience as measured by high-fidelity eye tracking.
Future directions suggested in the literature include adaptive moving-window interfaces, multimodal crowd attention fusion (clicks+gaze), gamified annotation, and real-time agent guidance informed by click/attention predictions.
7. Relationship to Crowdsourcing and Attention Mapping
SeeClick’s foundational methodology is directly connected to crowdsourced visual attention estimation, with BubbleView representing a prototypical “see-click” interface (Kim et al., 2017). These methods replace continuous eye-tracking with cost-effective, scalable mouse-click mapping, yielding reliable importance maps for diverse content: info-visualizations, natural scenes, web pages, graphic design. Best-practice for behavioral See-Click data collection includes 10–15 click sessions per image, r_{\text{bubble}} ≈ 1–2 deg visual angle, and σ_{\text{blur}} ensuring illegibility outside the bubble. Such datasets, convolved into saliency/importance maps, provide both cognitive insights and machine learning supervision for modern GUI-grounding agents.
The convergence between crowdsourced click-derived importance data and embedded agent grounding architectures illustrates the scope of SeeClick—from psychological attention mapping to its pivotal role in enabling robust, screen-centric automation across mobile, desktop, and web interfaces.
References:
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Cheng et al., 2024)
- ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents (Hoscilowicz et al., 2024)
- BubbleView: an interface for crowdsourcing image importance maps and tracking visual attention (Kim et al., 2017)