Papers
Topics
Authors
Recent
Search
2000 character limit reached

Surfer-H & Holo1: Vision-Language Web Automation

Updated 10 April 2026
  • Surfer-H and Holo1 are a GUI-centric web agent and vision-language model family that enable automation through sequential reasoning based solely on browser screenshots.
  • The architecture integrates three modules—Policy, Localizer, and Validator—each dedicated to action generation, UI element localization, and output validation.
  • The system achieves state-of-the-art accuracy with cost-effective performance, as demonstrated by high benchmark scores and Pareto-optimal cost-accuracy trade-offs.

Surfer-H is a graphical user interface (GUI)-centric web agent designed to perform user-defined tasks through sequential vision-language reasoning and interaction mechanisms. It operates without explicit access to the Document Object Model (DOM) or accessibility tree, relying exclusively on raw browser screenshots and text instructions. Core to its operation is integration with the Holo1 family of open-weight vision-LLMs (VLMs), which enable robust web navigation, complex multimodal reasoning, and efficient information extraction. Surfer-H, when powered by Holo1, achieves state-of-the-art performance across generalist UI benchmarks and the newly introduced WebClick localization benchmark, while demonstrating Pareto-optimal trade-offs between accuracy and computational cost, and supporting open-source research in the field (Andreux et al., 3 Jun 2025).

1. System Architecture and Modular Design

1.1 Surfer-H Agent Workflow

Surfer-H’s agentic process is structured as a sequential pipeline of three trainable VLM modules: Policy, Localizer, and Validator. Tasks are processed as follows:

  1. Policy (π): At each timestep tt, the policy module generates the next “thought” (chain-of-thought entry), “notes” (textual records), and “action” based on the current memory MtM_t, which consists of the task description, the last KK screenshots, prior thoughts/notes/actions, and the action sequence a1ata_1…a_t.

    • Formal mapping (Eq. 1):

    (thoughtt+1,notest+1,actiont+1)π(task,{thoughtk,notesk,actionk,screenshott3<kt  kt})(\text{thought}_{t+1}, \text{notes}_{t+1}, \text{action}_{t+1}) \sim π(\text{task}, \{\text{thought}_k, \text{notes}_k, \text{action}_k, \text{screenshot}_{t-3<k \leq t}\ |\ k \leq t\})

  2. Localizer: For actions requiring spatial localization (e.g., click “Submit”), the module predicts the precise (x,y)(x, y) coordinates of the relevant UI element over the screenshot.
  3. Execution: The selected action is executed on a headless browser (operations include click, type, scroll, navigate, answer).
  4. Validator (VV): When an “answer” action is produced, the validator module inspects the answer with reference to the task and recent screenshots, approving or rejecting it and providing rationales. Feedback is appended to MtM_t if the answer is rejected.

    • Validator mapping (Eq. 2):

    (success,explanation)V(task,answer,{screenshott3<kt})(\text{success}, \text{explanation}) \sim V(\text{task}, \text{answer}, \{\text{screenshot}_{t-3<k \leq t}\})

1.2 Holo1 Model Internals

Holo1 is a multitask VLM family, each variant rooted in Qwen 2.5-VL-Instruct checkpoints (3B or 7B parameters). Key characteristics:

  • Tokenization: Utilizes the Qwen multimodal tokenizer; a 1200×12001200 \times 1200 px screenshot yields approximately 1,280 image tokens (plus text).
  • Unified Multitask Head: A single model is multi-purposed to generate policy (thought/action), localize (predict coordinates or bounding boxes), and validate outputs (boolean plus rationale).
  • Input Format: Chat-style exchange, where system/user prompts and screenshots are passed alongside instructions, and assistant (model) responses provide multimodal outputs.

Inference flow:

  • Screenshot and agent memory are fed to Holo1 policy for thought and action generation.
  • If the action is of type CLICK(element_description), the description and screenshot are provided to the localizer role for MtM_t0 prediction.
  • On an ANSWER action, task context, answer, and screenshots are sent to the validator role for approval or further feedback.

2. Training Data and Methodology

2.1 Data Mixture and Breakdown

The consolidated Holo1 pretraining dataset totals approximately 31.5 billion tokens, structured as follows:

Dataset Group Mixture Subset Tokens (B) % of Total
GUI Grounding WebCrawl (4M pages, 89M clicks) 12.19 38.76
Screenspot-V2 (open-source UI) 3.42 10.87
WebSynthetic (calendars, tables, icons) 0.37 1.17
Subtotal 15.98 50.79
Complex Vis. Underst. Coordinate Validation (5M triplets) 2.70 8.59
UI Extraction (7M pages) 5.93 18.86
VQA (charts, docs, dashboards) 1.52 4.84
Subtotal 10.16 32.28
Behavior Learning Policy traces (WebVoyager/Extended) 4.87 15.48
Validator pairs 0.46 1.45
Subtotal 5.32 16.93
Grand Total 31.46 100.0

2.2 Dataset Composition Details

  • WebCrawl: Encompasses 4M real-world webpages, with parsed HTML. Interactive elements are paired with synthetically generated “intents” (from a frontier model) and labeled with ground-truth click coordinates.
  • WebSynthetic: Adversarial proprietary UI tasks focusing on widgets, tables, and ambiguous icons.
  • Coordinate Validation: Contains 5M triplets judged correct/incorrect via Set-of-Marks prompting.
  • UI Extraction: Screenshots mapped to full element bounding boxes and semantic labels.
  • VQA: 0.6M public chart/document images plus 300k internal dashboards/tables, yielding 150M text tokens.
  • Behavior Learning: Policy traces derived via behavioral cloning on successful agent runs (WebVoyager: 643 tasks, 15 sites; WebVoyagerExtended: 15,000 tasks, 330 sites), supporting learning of the policy MtM_t1. Traces comprise up to 30 steps with (thought, notes, action) triplets.
  • Validation: 1M policy outputs paired with frontier VLM-judged success booleans and rationales.

2.3 Preprocessing and Optimization

All data is unified into a multimodal chat-style format, supporting multitask objectives (policy, localizer, validator) within a single model. Fine-tuning is performed on a proprietary codebase, initialized from Qwen 2.5-VL-Instruct, optimizing mixed text/image generation and tool-call tasks. Toxicity screening using ToxiGen yields 2.1% flagged for Holo1-3B and 1.5% for Holo1-7B, comparable or better than the base models.

The absence of explicit training hyperparameters (learning rate, batch size, epochs, hardware) suggests adherence to established multimodal fine-tuning parameters (e.g., learning rate MtM_t2, batch size MtM_t3–MtM_t4, MtM_t5–MtM_t6 epochs).

3. Empirical Benchmarking and Results

3.1 UI Localization

Performance on public and custom benchmarks is summarized below:

Model Screenspot v1 v2 Pro GroundUI WebClick(agent) WebClick(calendar) WebClick(human) Avg
Qwen2.5-VL-3B-Instruct 82.78 84.34 7.91 70.50 76.26 51.70 85.07 65.51
UGround-V1-2B 77.12 79.31 21.32 78.60 84.41 50.76 78.50 67.15
UI-TARS-2B 66.82 69.39 16.38 80.75 78.68 42.05 70.33 60.63
Holo1-3B 85.93 88.91 23.66 74.75 83.02 65.91 88.80 73.55
Qwen2.5-VL-7B-Instruct 85.53 88.04 10.12 78.75 78.47 59.09 85.22 69.32
UGround-V1-7B 85.69 84.26 30.93 82.70 92.37 68.75 84.84 75.65
UI-TARS-7B 84.20 86.70 23.53 81.00 90.47 63.45 87.03 73.77
Holo1-7B 87.42 89.85 26.06 78.50 89.77 72.92 88.80 76.19

Holo1-3B and Holo1-7B exhibit the highest average scores for their scale. Notably, on calendar tasks—a known difficult category—Holo1-7B reaches 72.9% versus 50–63% for baselines.

3.2 Task Automation Performance

On WebVoyager (643 tasks, 10 sites):

  • Surfer-H is evaluated with up to 30 steps and 10 answer attempts per task.
  • Majority-vote by three GPT-4o runs is used to assess task success.
  • Baselines: OpenAI Operator (87%), Project Mariner (83.5%), BrowserUse (89.1%).
Attempts Accuracy (%) Cost (/task)</th></tr></thead><tbody><tr><td>1</td><td>69.6</td><td>0.05</td></tr><tr><td>2</td><td>80.8</td><td>0.07</td></tr><tr><td>5</td><td>88.2</td><td>0.10</td></tr><tr><td>10</td><td>92.2</td><td>0.13</td></tr></tbody></table></div><p>At10attempts,SurferH+attains92.2/task)</th> </tr> </thead><tbody><tr> <td>1</td> <td>69.6</td> <td>0.05</td> </tr> <tr> <td>2</td> <td>80.8</td> <td>0.07</td> </tr> <tr> <td>5</td> <td>88.2</td> <td>0.10</td> </tr> <tr> <td>10</td> <td>92.2</td> <td>0.13</td> </tr> </tbody></table></div> <p>At 10 attempts, Surfer-H+ attains 92.2% accuracy at M_t$7 per task, setting a new state of the art at dramatically lower cost.

3.3 Pareto-Optimality Analysis

Cost $M_t$8 per task is defined as:

$M_t$9

where $K$0, $K$1 are the input/output token counts for module $K$2, $K$3, $K$4 are the per-million token prices, and $K$5 is the number of screenshots.

A model is Pareto-optimal if it is not dominated in the space of accuracy $K$6 and cost $K$7:

$K$8

Empirical results place Holo1-based Surfer-H systems on the Pareto front for both 3B and 7B models over all evaluated attempt budgets.

4. Cost-Efficiency Framework

4.1 Inference Cost Model

Per the cost table:

Model $K$9 input $a_1…a_t$0 output Image-tokens per screenshot
GPT-4o 2.5 10.0 772
GPT-4o-mini 0.15 0.60 25,508
GPT-4.1 2.0 8.0 772
GPT-4.1-mini 0.40 1.6 2,348
Gemini-2.0-Flash 0.10 0.4 1,290
Qwen2.5-VL-7B-Instruct 0.15 0.6 1,280
Qwen2.5-VL-32B-Instruct 0.50 2.0 1,280

Total task inference cost is computed as: $a_1…a_t$1 where $a_1…a_t$2, $a_1…a_t$3 are input/output tokens per call.

4.2 Pareto Trade-Off

Pareto-optimality, as operationalized in Surfer-H evaluation, formalizes the trade-off boundary (accuracy vs. cost) for web agent systems, with Holo1-enabled Surfer-H variants tracing the lower-right frontier (high accuracy, low cost) across all measured operational points.

5. Open-Source Contributions and Accessibility

5.1 WebClick Dataset

WebClick is a publicly available benchmark, hosted at https://huggingface.co/datasets/Hcompany/WebClick under Apache-2.0. It includes 1,639 annotated web screenshots (100+ sites), each comprising an instruction, bounding box ground-truths, and subsets sampled from agent traces, crowd-sourced human activity, and adversarial calendar tasks.

5.2 Model Release

Holo1 model variants (3B and 7B) are released at https://huggingface.co/collections/Hcompany/holo1-683dd1eece7eb077b96d0cbd under Apache-2.0, with full multimodal weights and tokenization configuration.

5.3 Quick-Start Integration

Holo1 models can be instantiated via PyTorch/Transformers as follows:

$a_1…a_t$4


Surfer-H powered by Holo1 models demonstrates leading performance on vision-driven web automation benchmarks, attaining 92.2% accuracy on WebVoyager at $0.13 per task and Pareto-optimal cost-accuracy profiles. Open-sourcing of both code and key resources is positioned to support ongoing research on scalable, multimodal web agents (Andreux et al., 3 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surfer-H and Holo1.