Surfer-H & Holo1: Vision-Language Web Automation

Updated 10 April 2026

Surfer-H and Holo1 are a GUI-centric web agent and vision-language model family that enable automation through sequential reasoning based solely on browser screenshots.
The architecture integrates three modules—Policy, Localizer, and Validator—each dedicated to action generation, UI element localization, and output validation.
The system achieves state-of-the-art accuracy with cost-effective performance, as demonstrated by high benchmark scores and Pareto-optimal cost-accuracy trade-offs.

Surfer-H is a graphical user interface (GUI)-centric web agent designed to perform user-defined tasks through sequential vision-language reasoning and interaction mechanisms. It operates without explicit access to the Document Object Model (DOM) or accessibility tree, relying exclusively on raw browser screenshots and text instructions. Core to its operation is integration with the Holo1 family of open-weight vision-LLMs (VLMs), which enable robust web navigation, complex multimodal reasoning, and efficient information extraction. Surfer-H, when powered by Holo1, achieves state-of-the-art performance across generalist UI benchmarks and the newly introduced WebClick localization benchmark, while demonstrating Pareto-optimal trade-offs between accuracy and computational cost, and supporting open-source research in the field (Andreux et al., 3 Jun 2025).

1. System Architecture and Modular Design

1.1 Surfer-H Agent Workflow

Surfer-H’s agentic process is structured as a sequential pipeline of three trainable VLM modules: Policy, Localizer, and Validator. Tasks are processed as follows:

Policy (π): At each timestep $t$ $t$ , the policy module generates the next “thought” (chain-of-thought entry), “notes” (textual records), and “action” based on the current memory $M_t$ $M_{t}$ , which consists of the task description, the last $K$ $K$ screenshots, prior thoughts/notes/actions, and the action sequence $a_1…a_t$ $a_{1} \dots a_{t}$ .
- Formal mapping (Eq. 1):
$(\text{thought}_{t+1}, \text{notes}_{t+1}, \text{action}_{t+1}) \sim π(\text{task}, \{\text{thought}_k, \text{notes}_k, \text{action}_k, \text{screenshot}_{t-3<k \leq t}\ |\ k \leq t\})$
Localizer: For actions requiring spatial localization (e.g., click “Submit”), the module predicts the precise $(x, y)$ coordinates of the relevant UI element over the screenshot.
Execution: The selected action is executed on a headless browser (operations include click, type, scroll, navigate, answer).
Validator ( $V$ ): When an “answer” action is produced, the validator module inspects the answer with reference to the task and recent screenshots, approving or rejecting it and providing rationales. Feedback is appended to $M_t$ $M_{t}$ if the answer is rejected.
- Validator mapping (Eq. 2):
$(\text{success}, \text{explanation}) \sim V(\text{task}, \text{answer}, \{\text{screenshot}_{t-3<k \leq t}\})$

1.2 Holo1 Model Internals

Holo1 is a multitask VLM family, each variant rooted in Qwen 2.5-VL-Instruct checkpoints (3B or 7B parameters). Key characteristics:

Tokenization: Utilizes the Qwen multimodal tokenizer; a $1200 \times 1200$ px screenshot yields approximately 1,280 image tokens (plus text).
Unified Multitask Head: A single model is multi-purposed to generate policy (thought/action), localize (predict coordinates or bounding boxes), and validate outputs (boolean plus rationale).
Input Format: Chat-style exchange, where system/user prompts and screenshots are passed alongside instructions, and assistant (model) responses provide multimodal outputs.

Inference flow:

Screenshot and agent memory are fed to Holo1 policy for thought and action generation.
If the action is of type CLICK(element_description), the description and screenshot are provided to the localizer role for $M_t$ 0 prediction.
On an ANSWER action, task context, answer, and screenshots are sent to the validator role for approval or further feedback.

2. Training Data and Methodology

2.1 Data Mixture and Breakdown

The consolidated Holo1 pretraining dataset totals approximately 31.5 billion tokens, structured as follows:

Dataset Group	Mixture Subset	Tokens (B)	% of Total
GUI Grounding	WebCrawl (4M pages, 89M clicks)	12.19	38.76
	Screenspot-V2 (open-source UI)	3.42	10.87
	WebSynthetic (calendars, tables, icons)	0.37	1.17
Subtotal		15.98	50.79
Complex Vis. Underst.	Coordinate Validation (5M triplets)	2.70	8.59
	UI Extraction (7M pages)	5.93	18.86
	VQA (charts, docs, dashboards)	1.52	4.84
Subtotal		10.16	32.28
Behavior Learning	Policy traces (WebVoyager/Extended)	4.87	15.48
	Validator pairs	0.46	1.45
Subtotal		5.32	16.93
Grand Total		31.46	100.0

2.2 Dataset Composition Details

WebCrawl: Encompasses 4M real-world webpages, with parsed HTML. Interactive elements are paired with synthetically generated “intents” (from a frontier model) and labeled with ground-truth click coordinates.
WebSynthetic: Adversarial proprietary UI tasks focusing on widgets, tables, and ambiguous icons.
Coordinate Validation: Contains 5M triplets judged correct/incorrect via Set-of-Marks prompting.
UI Extraction: Screenshots mapped to full element bounding boxes and semantic labels.
VQA: 0.6M public chart/document images plus 300k internal dashboards/tables, yielding 150M text tokens.
Behavior Learning: Policy traces derived via behavioral cloning on successful agent runs (WebVoyager: 643 tasks, 15 sites; WebVoyagerExtended: 15,000 tasks, 330 sites), supporting learning of the policy $M_t$ 1. Traces comprise up to 30 steps with (thought, notes, action) triplets.
Validation: 1M policy outputs paired with frontier VLM-judged success booleans and rationales.

2.3 Preprocessing and Optimization

All data is unified into a multimodal chat-style format, supporting multitask objectives (policy, localizer, validator) within a single model. Fine-tuning is performed on a proprietary codebase, initialized from Qwen 2.5-VL-Instruct, optimizing mixed text/image generation and tool-call tasks. Toxicity screening using ToxiGen yields 2.1% flagged for Holo1-3B and 1.5% for Holo1-7B, comparable or better than the base models.

The absence of explicit training hyperparameters (learning rate, batch size, epochs, hardware) suggests adherence to established multimodal fine-tuning parameters (e.g., learning rate $M_t$ 2, batch size $M_t$ 3– $M_t$ 4, $M_t$ 5– $M_t$ 6 epochs).

3. Empirical Benchmarking and Results

3.1 UI Localization

Performance on public and custom benchmarks is summarized below:

Model	Screenspot v1	v2	Pro	GroundUI	WebClick(agent)	WebClick(calendar)	WebClick(human)	Avg
Qwen2.5-VL-3B-Instruct	82.78	84.34	7.91	70.50	76.26	51.70	85.07	65.51
UGround-V1-2B	77.12	79.31	21.32	78.60	84.41	50.76	78.50	67.15
UI-TARS-2B	66.82	69.39	16.38	80.75	78.68	42.05	70.33	60.63
Holo1-3B	85.93	88.91	23.66	74.75	83.02	65.91	88.80	73.55
Qwen2.5-VL-7B-Instruct	85.53	88.04	10.12	78.75	78.47	59.09	85.22	69.32
UGround-V1-7B	85.69	84.26	30.93	82.70	92.37	68.75	84.84	75.65
UI-TARS-7B	84.20	86.70	23.53	81.00	90.47	63.45	87.03	73.77
Holo1-7B	87.42	89.85	26.06	78.50	89.77	72.92	88.80	76.19

Holo1-3B and Holo1-7B exhibit the highest average scores for their scale. Notably, on calendar tasks—a known difficult category—Holo1-7B reaches 72.9% versus 50–63% for baselines.

3.2 Task Automation Performance

On WebVoyager (643 tasks, 10 sites):

Surfer-H is evaluated with up to 30 steps and 10 answer attempts per task.
Majority-vote by three GPT-4o runs is used to assess task success.
Baselines: OpenAI Operator (87%), Project Mariner (83.5%), BrowserUse (89.1%).

Attempts

Accuracy (%)

Cost (

/task)</th> </tr> </thead><tbody><tr> <td>1</td> <td>69.6</td> <td>0.05</td> </tr> <tr> <td>2</td> <td>80.8</td> <td>0.07</td> </tr> <tr> <td>5</td> <td>88.2</td> <td>0.10</td> </tr> <tr> <td>10</td> <td>92.2</td> <td>0.13</td> </tr> </tbody></table></div> <p>At 10 attempts, Surfer-H+ attains 92.2% accuracy at

M_t$7 per task, setting a new state of the art at dramatically lower cost.

3.3 Pareto-Optimality Analysis

Cost $M_t$8 per task is defined as:

$M_t$9

where $K$0, $K$1 are the input/output token counts for module $K$2, $K$3, $K$4 are the per-million token prices, and $K$5 is the number of screenshots.

A model is Pareto-optimal if it is not dominated in the space of accuracy $K$6 and cost $K$7:

$K$8

Empirical results place Holo1-based Surfer-H systems on the Pareto front for both 3B and 7B models over all evaluated attempt budgets.

4. Cost-Efficiency Framework

4.1 Inference Cost Model

Per the cost table:

Model	$K$9 input	$a_1…a_t$0 output	Image-tokens per screenshot
GPT-4o	2.5	10.0	772
GPT-4o-mini	0.15	0.60	25,508
GPT-4.1	2.0	8.0	772
GPT-4.1-mini	0.40	1.6	2,348
Gemini-2.0-Flash	0.10	0.4	1,290
Qwen2.5-VL-7B-Instruct	0.15	0.6	1,280
Qwen2.5-VL-32B-Instruct	0.50	2.0	1,280

Total task inference cost is computed as: $a_1…a_t$1 where $a_1…a_t$2, $a_1…a_t$3 are input/output tokens per call.

4.2 Pareto Trade-Off

Pareto-optimality, as operationalized in Surfer-H evaluation, formalizes the trade-off boundary (accuracy vs. cost) for web agent systems, with Holo1-enabled Surfer-H variants tracing the lower-right frontier (high accuracy, low cost) across all measured operational points.

5. Open-Source Contributions and Accessibility

5.1 WebClick Dataset

WebClick is a publicly available benchmark, hosted at https://huggingface.co/datasets/Hcompany/WebClick under Apache-2.0. It includes 1,639 annotated web screenshots (100+ sites), each comprising an instruction, bounding box ground-truths, and subsets sampled from agent traces, crowd-sourced human activity, and adversarial calendar tasks.

5.2 Model Release

Holo1 model variants (3B and 7B) are released at https://huggingface.co/collections/Hcompany/holo1-683dd1eece7eb077b96d0cbd under Apache-2.0, with full multimodal weights and tokenization configuration.

5.3 Quick-Start Integration

Holo1 models can be instantiated via PyTorch/Transformers as follows:

$a_1…a_t$4

Surfer-H powered by Holo1 models demonstrates leading performance on vision-driven web automation benchmarks, attaining 92.2% accuracy on WebVoyager at $0.13 per task and Pareto-optimal cost-accuracy profiles. Open-sourcing of both code and key resources is positioned to support ongoing research on scalable, multimodal web agents (Andreux et al., 3 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surfer-H and Holo1.