Surfer-H & Holo1: Vision-Language Web Automation
- Surfer-H and Holo1 are a GUI-centric web agent and vision-language model family that enable automation through sequential reasoning based solely on browser screenshots.
- The architecture integrates three modules—Policy, Localizer, and Validator—each dedicated to action generation, UI element localization, and output validation.
- The system achieves state-of-the-art accuracy with cost-effective performance, as demonstrated by high benchmark scores and Pareto-optimal cost-accuracy trade-offs.
Surfer-H is a graphical user interface (GUI)-centric web agent designed to perform user-defined tasks through sequential vision-language reasoning and interaction mechanisms. It operates without explicit access to the Document Object Model (DOM) or accessibility tree, relying exclusively on raw browser screenshots and text instructions. Core to its operation is integration with the Holo1 family of open-weight vision-LLMs (VLMs), which enable robust web navigation, complex multimodal reasoning, and efficient information extraction. Surfer-H, when powered by Holo1, achieves state-of-the-art performance across generalist UI benchmarks and the newly introduced WebClick localization benchmark, while demonstrating Pareto-optimal trade-offs between accuracy and computational cost, and supporting open-source research in the field (Andreux et al., 3 Jun 2025).
1. System Architecture and Modular Design
1.1 Surfer-H Agent Workflow
Surfer-H’s agentic process is structured as a sequential pipeline of three trainable VLM modules: Policy, Localizer, and Validator. Tasks are processed as follows:
- Policy (π): At each timestep , the policy module generates the next “thought” (chain-of-thought entry), “notes” (textual records), and “action” based on the current memory , which consists of the task description, the last screenshots, prior thoughts/notes/actions, and the action sequence .
- Formal mapping (Eq. 1):
- Localizer: For actions requiring spatial localization (e.g., click “Submit”), the module predicts the precise coordinates of the relevant UI element over the screenshot.
- Execution: The selected action is executed on a headless browser (operations include click, type, scroll, navigate, answer).
- Validator (): When an “answer” action is produced, the validator module inspects the answer with reference to the task and recent screenshots, approving or rejecting it and providing rationales. Feedback is appended to if the answer is rejected.
- Validator mapping (Eq. 2):
1.2 Holo1 Model Internals
Holo1 is a multitask VLM family, each variant rooted in Qwen 2.5-VL-Instruct checkpoints (3B or 7B parameters). Key characteristics:
- Tokenization: Utilizes the Qwen multimodal tokenizer; a px screenshot yields approximately 1,280 image tokens (plus text).
- Unified Multitask Head: A single model is multi-purposed to generate policy (thought/action), localize (predict coordinates or bounding boxes), and validate outputs (boolean plus rationale).
- Input Format: Chat-style exchange, where system/user prompts and screenshots are passed alongside instructions, and assistant (model) responses provide multimodal outputs.
Inference flow:
- Screenshot and agent memory are fed to Holo1 policy for thought and action generation.
- If the action is of type CLICK(element_description), the description and screenshot are provided to the localizer role for 0 prediction.
- On an ANSWER action, task context, answer, and screenshots are sent to the validator role for approval or further feedback.
2. Training Data and Methodology
2.1 Data Mixture and Breakdown
The consolidated Holo1 pretraining dataset totals approximately 31.5 billion tokens, structured as follows:
| Dataset Group | Mixture Subset | Tokens (B) | % of Total |
|---|---|---|---|
| GUI Grounding | WebCrawl (4M pages, 89M clicks) | 12.19 | 38.76 |
| Screenspot-V2 (open-source UI) | 3.42 | 10.87 | |
| WebSynthetic (calendars, tables, icons) | 0.37 | 1.17 | |
| Subtotal | 15.98 | 50.79 | |
| Complex Vis. Underst. | Coordinate Validation (5M triplets) | 2.70 | 8.59 |
| UI Extraction (7M pages) | 5.93 | 18.86 | |
| VQA (charts, docs, dashboards) | 1.52 | 4.84 | |
| Subtotal | 10.16 | 32.28 | |
| Behavior Learning | Policy traces (WebVoyager/Extended) | 4.87 | 15.48 |
| Validator pairs | 0.46 | 1.45 | |
| Subtotal | 5.32 | 16.93 | |
| Grand Total | 31.46 | 100.0 |
2.2 Dataset Composition Details
- WebCrawl: Encompasses 4M real-world webpages, with parsed HTML. Interactive elements are paired with synthetically generated “intents” (from a frontier model) and labeled with ground-truth click coordinates.
- WebSynthetic: Adversarial proprietary UI tasks focusing on widgets, tables, and ambiguous icons.
- Coordinate Validation: Contains 5M triplets judged correct/incorrect via Set-of-Marks prompting.
- UI Extraction: Screenshots mapped to full element bounding boxes and semantic labels.
- VQA: 0.6M public chart/document images plus 300k internal dashboards/tables, yielding 150M text tokens.
- Behavior Learning: Policy traces derived via behavioral cloning on successful agent runs (WebVoyager: 643 tasks, 15 sites; WebVoyagerExtended: 15,000 tasks, 330 sites), supporting learning of the policy 1. Traces comprise up to 30 steps with (thought, notes, action) triplets.
- Validation: 1M policy outputs paired with frontier VLM-judged success booleans and rationales.
2.3 Preprocessing and Optimization
All data is unified into a multimodal chat-style format, supporting multitask objectives (policy, localizer, validator) within a single model. Fine-tuning is performed on a proprietary codebase, initialized from Qwen 2.5-VL-Instruct, optimizing mixed text/image generation and tool-call tasks. Toxicity screening using ToxiGen yields 2.1% flagged for Holo1-3B and 1.5% for Holo1-7B, comparable or better than the base models.
The absence of explicit training hyperparameters (learning rate, batch size, epochs, hardware) suggests adherence to established multimodal fine-tuning parameters (e.g., learning rate 2, batch size 3–4, 5–6 epochs).
3. Empirical Benchmarking and Results
3.1 UI Localization
Performance on public and custom benchmarks is summarized below:
| Model | Screenspot v1 | v2 | Pro | GroundUI | WebClick(agent) | WebClick(calendar) | WebClick(human) | Avg |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-3B-Instruct | 82.78 | 84.34 | 7.91 | 70.50 | 76.26 | 51.70 | 85.07 | 65.51 |
| UGround-V1-2B | 77.12 | 79.31 | 21.32 | 78.60 | 84.41 | 50.76 | 78.50 | 67.15 |
| UI-TARS-2B | 66.82 | 69.39 | 16.38 | 80.75 | 78.68 | 42.05 | 70.33 | 60.63 |
| Holo1-3B | 85.93 | 88.91 | 23.66 | 74.75 | 83.02 | 65.91 | 88.80 | 73.55 |
| Qwen2.5-VL-7B-Instruct | 85.53 | 88.04 | 10.12 | 78.75 | 78.47 | 59.09 | 85.22 | 69.32 |
| UGround-V1-7B | 85.69 | 84.26 | 30.93 | 82.70 | 92.37 | 68.75 | 84.84 | 75.65 |
| UI-TARS-7B | 84.20 | 86.70 | 23.53 | 81.00 | 90.47 | 63.45 | 87.03 | 73.77 |
| Holo1-7B | 87.42 | 89.85 | 26.06 | 78.50 | 89.77 | 72.92 | 88.80 | 76.19 |
Holo1-3B and Holo1-7B exhibit the highest average scores for their scale. Notably, on calendar tasks—a known difficult category—Holo1-7B reaches 72.9% versus 50–63% for baselines.
3.2 Task Automation Performance
On WebVoyager (643 tasks, 10 sites):
- Surfer-H is evaluated with up to 30 steps and 10 answer attempts per task.
- Majority-vote by three GPT-4o runs is used to assess task success.
- Baselines: OpenAI Operator (87%), Project Mariner (83.5%), BrowserUse (89.1%).
| Attempts | Accuracy (%) | Cost (M_t$7 per task, setting a new state of the art at dramatically lower cost.
3.3 Pareto-Optimality AnalysisCost $M_t$8 per task is defined as: $M_t$9 where $K$0, $K$1 are the input/output token counts for module $K$2, $K$3, $K$4 are the per-million token prices, and $K$5 is the number of screenshots. A model is Pareto-optimal if it is not dominated in the space of accuracy $K$6 and cost $K$7: $K$8 Empirical results place Holo1-based Surfer-H systems on the Pareto front for both 3B and 7B models over all evaluated attempt budgets. 4. Cost-Efficiency Framework4.1 Inference Cost ModelPer the cost table:
Total task inference cost is computed as: $a_1…a_t$1 where $a_1…a_t$2, $a_1…a_t$3 are input/output tokens per call. 4.2 Pareto Trade-OffPareto-optimality, as operationalized in Surfer-H evaluation, formalizes the trade-off boundary (accuracy vs. cost) for web agent systems, with Holo1-enabled Surfer-H variants tracing the lower-right frontier (high accuracy, low cost) across all measured operational points. 5. Open-Source Contributions and Accessibility5.1 WebClick DatasetWebClick is a publicly available benchmark, hosted at https://huggingface.co/datasets/Hcompany/WebClick under Apache-2.0. It includes 1,639 annotated web screenshots (100+ sites), each comprising an instruction, bounding box ground-truths, and subsets sampled from agent traces, crowd-sourced human activity, and adversarial calendar tasks. 5.2 Model ReleaseHolo1 model variants (3B and 7B) are released at https://huggingface.co/collections/Hcompany/holo1-683dd1eece7eb077b96d0cbd under Apache-2.0, with full multimodal weights and tokenization configuration. 5.3 Quick-Start IntegrationHolo1 models can be instantiated via PyTorch/Transformers as follows: $a_1…a_t$4 Surfer-H powered by Holo1 models demonstrates leading performance on vision-driven web automation benchmarks, attaining 92.2% accuracy on WebVoyager at $0.13 per task and Pareto-optimal cost-accuracy profiles. Open-sourcing of both code and key resources is positioned to support ongoing research on scalable, multimodal web agents (Andreux et al., 3 Jun 2025). Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.
Discover trending papers, chat with arXiv, and more.
|
|---|