SeeAct Web Agent Overview
- SeeAct Web Agent is an autonomous framework that integrates visual grounding, language-based planning, and browser automation to execute complex natural language tasks.
- It employs multiple grounding strategies—attribute-based, textual-choice, and image annotation—to resolve and select DOM elements with high reliability.
- Robust evaluation protocols using datasets like Mind2Web reveal performance gaps and highlight the need for modular improvements and future research directions.
SeeAct Web Agent refers to a family of autonomous agents designed for web interaction, combining visual grounding, language-model-based planning, and browser-based actuation to execute complex, natural language-driven tasks on arbitrary websites. Developed principally in the context of Large Multimodal Models (LMMs) such as GPT-4V(ision), SeeAct integrates step-wise visual understanding, DOM manipulation, and iterative action planning, establishing a generalist framework for web automation and agentic interaction. Its architecture is notable for modular action generation, cross-modal grounding strategies, and robust evaluation protocols spanning both offline and live web task scenarios (Zheng et al., 2024).
1. System Architecture and Action Loop
SeeAct is conceptualized as a looped, step-wise agent operating over three primary modalities: (1) high-level task description in natural language, (2) browser state as HTML/DOM trees, and (3) rendered page screenshots for visual context. At each time step , the agent consumes state (DOM and image) and produces an atomic action , formally an (element, operation, value) triplet:
- , where is a DOM element, , and is an optional input (Zheng et al., 2024).
Action generation is separated from element grounding:
- The LMM first outputs a textual action plan ().
- An element-grounding module resolves to a concrete DOM element using one or more strategies (attribute matching, choice ranking, and visual annotation).
For execution, SeeAct pairs with tools such as Playwright to perform browser automation using the selected action. The environment updates, and the agent receives new observations for subsequent planning.
2. Grounding Strategies and Multi-modal Integration
Mapping high-level action descriptions to executable browser actions is a central challenge addressed via multiple grounding strategies in SeeAct:
- Attribute-based: The LMM is prompted to specify both the element type and visible text. DOM elements matching those are heuristically selected. Disambiguation is required if multiple candidates emerge.
- Textual Choices: Candidate elements are ranked (e.g., via a CrossEncoder model), chunked, and presented to the LMM as a multi-choice list with HTML snippets. The LMM picks the target (“A”, “B”, …), yielding significantly higher reliability than attribute-only or visual annotation-based forms.
- Image Annotation: Candidates receive overlayed bounding boxes labeled numerically on the screenshot. The LMM identifies the match by index. This strategy is prone to hallucination and misalignment.
- Oracle Grounding: Human annotators simulate perfect grounding for upper-bound performance evaluation.
Textual-choice grounding dominates in accuracy among automated methods, but there remains a notable (20–30%) gap versus oracle performance, largely attributed to grounding ambiguities and multi-element similarity in complex pages (Zheng et al., 2024).
3. Evaluation Protocols, Benchmarks, and Metrics
SeeAct agents are evaluated both offline (on cached web states) and online (live interaction) using datasets like Mind2Web (Zheng et al., 2024) and controlled test suites (Chevrot et al., 2 Apr 2025). Metrics include:
- Element Accuracy: Correctness of selected DOM elements.
- Step and Task Success Rates: Proportion of fully completed actions or tasks.
- Binary Classification (Testing Contexts): Accuracy, sensitivity (recall for failing tests), specificity (true-negative rate), especially in autonomous testing settings (Chevrot et al., 2 Apr 2025).
Example quantitative highlights:
- On Mind2Web, GPT-4V(ision) with Textual Choice grounding achieves 40.2% step success rate (Cross-Task), 37.8% full-task online completion, and 51.1% under oracle grounding (Zheng et al., 2024).
- For autonomous test agent variants (SeeAct-ATA), average pass/fail accuracy is 0.55, improving to 0.71 for modular orchestrator-based PinATA (Chevrot et al., 2 Apr 2025).
4. Specializations: Autonomous Test Agents (SeeAct-ATA)
SeeAct-ATA adapts the SeeAct paradigm for end-to-end web application testing. It specializes in reading natural language test cases consisting of sequential steps and embedded assertions. The workflow involves:
- Prompting an LLM (e.g., GPT-4o) to reason about current UI state, propose atomic actions, and verify step-wise assertions.
- Executing actions via Playwright after grounding the selected element.
- Updating state (screenshots, DOM), and verifying assertions.
- Returning a pass/fail verdict and fault localization if a test fails (Chevrot et al., 2 Apr 2025).
Performance is benchmarked on a reproducible suite across multiple web applications (Classifieds, Postmill, OneStopShop):
- PinATA, a modular variant with orchestrator-actor-assertor architecture and advanced “Set-of-Marks” grounding, outperforms monolithic SeeAct-ATA by ~50% in accuracy (0.71 vs 0.55).
- Sensitivity and specificity reach up to 0.94 and 0.62, respectively, on specific apps (Chevrot et al., 2 Apr 2025).
5. Pipeline Decomposition and Failure Analysis
Fine-grained modular analysis reveals that SeeAct's pipeline naturally decomposes into three diagnosable stages: action prediction, element grounding, and action selection (Röder et al., 17 Sep 2025). This decomposition enables identification of bottlenecks:
- Planning errors (Stage 1) and action selection ambiguities (Stage 3) contribute to >30% of task failures.
- Grounding errors often stem from ambiguous HTML, multiple visually/structurally identical elements, or insufficient pre-selection.
- Introducing explicit global memory/context among parallel batches, hybrid visual-semantic grounding, and section-aware reasoning are key to reducing redundant candidate selection and boosting end-to-end accuracy.
A table summarizing modular pipeline performance on Mind2Web (Default pipeline, 90-task subset) is provided below (Röder et al., 17 Sep 2025):
| Model | Action Prediction Acc. | Grounding Acc. | End-to-end Acc. (FV) |
|---|---|---|---|
| GPT-4o | 70.17% | 62.87% | 48.78% |
| Gemini-1.5-pro | 56.18% | 52.94% | 35.71% |
| Claude 3.5 Sonnet | 56.58% | 49.27% | 24.08% |
6. Extensions: Multimodal and Egocentric Web Agent Benchmarks
Recent work extends SeeAct-style agents to multimodal and egocentric scenarios:
- Ego2Web introduces benchmarks that require perception over real-world, egocentric video streams, linking physical object identification (from wearables) with execution of web tasks (search, purchase, lookup). This highlights the importance of dense temporal visual input, modularized vision/planning, and error-feedback loops (Yu et al., 23 Mar 2026).
- Task performance in this context remains low (34.2% for SeeAct/GPT-4V, 58.6% for best browser-use agent), revealing substantial headroom for improvement, particularly in visual-to-symbol grounding and dynamic UI adaptation.
Essential design implications include preserving raw video signals for grounding, explicit separation of vision and web-planning modules, domain-specific UI toolkits, and integrating hybrid visual, DOM, and accessibility cues for robust element identification.
7. Limitations and Future Research Directions
Limitations identified across multiple studies include:
- Grounding bottlenecks: Substantial error rates in element selection arise from similarity, missing DOM cues, and long-range task dependencies.
- Planning robustness: Agents often explore ahead of prescribed steps, struggle with multi-path equivalence in web flows, and suffer in longer-step scenarios due to cascading errors.
- UI observability: Non-standard widgets, dynamic overlays, or elements outside the DOM snapshot degrade reliability.
- Action/model capacity: Lack of support for complex browser operations (tab management, drag-and-drop) and fine-grained GUI input (autocomplete triggers).
- Assertion verifiability: Limitations of screenshot-only or static DOM verification in scenarios requiring layout or side-effect checks.
Promising research directions comprise:
- Enhanced hybrid visual/structural/action grounding combining all available modalities.
- Modular planning architectures with explicit orchestrators, shared memory, and multi-agent feedback.
- Benchmarking with multi-path and multivalent success definitions to reflect the non-uniqueness of web flows.
- Human-in-the-loop guidance for correction of persistent misgroundings and reliability improvement (Zheng et al., 2024, Chevrot et al., 2 Apr 2025, Röder et al., 17 Sep 2025, Yu et al., 23 Mar 2026).
Advancements along these axes position SeeAct and its derivatives as a foundation for increasingly capable, resilient, and generalist web agents in both automation and interactive assistance domains.