- The paper presents a unified Find–Guide–Hide framework that grounds LLM outputs with direct HTML DOM overlays for verifiable web assistance.
- The study demonstrates significant efficiency and accuracy gains, reducing task times and manual efforts across retrieval, guidance, and content suppression modes.
- The results validate that coupling LLM reasoning with real-time DOM mutation enhances transparency, user control, and mixed-initiative human-AI collaboration.
PageGuide: A Browser Extension for In-Situ LLM-Grounded Web Assistance
Motivation and Problem Scope
The proliferation of LLM-powered browser agents has enabled users to retrieve information, automate web tasks, and filter online content with unprecedented ease. However, these systems routinely decouple their outputs from the underlying webpage state. Current assistants such as ChatGPT Atlas, Gemini, and browser-autonomous agents typically surface results in isolated sidebars, offer automation without verifiability, and entrust users with the burden of manual cross-referencing. This paradigm is fundamentally flawed for tasks demanding transparency, control, and trust: users are forced to blindly accept opaque outputs or expend significant effort verifying claims against cluttered interfaces. Moreover, static filtering tools (e.g., ad blockers) lack the semantic flexibility to support intent-driven content suppression.
PageGuide reframes this interaction model by anchoring LLM outputs directly in the HTML DOM through visual overlays, thereby optimizing for verifiable, mixed-initiative human-AI collaboration. This design targets three persistent user needs: rapid in-situ information finding, step-wise procedural guidance, and selective, reviewable content hiding.
System Design
Unified Find–Guide–Hide Framework
PageGuide is implemented as a Manifest v3 Chrome extension and operates via a pipeline that couples LLM reasoning with real-time HTML DOM mutation. Upon user query, a high-accuracy intent router assigns the request to one of three handlers:
- Find: For factual lookups, PageGuide generates natural language answers with inline citations that reference SoM-indexed HTML elements. Each claim is backed by in-page highlights, allowing immediate verification.
- Guide: For procedural tasks, the extension formulates an ordered plan, presenting instructions step-by-step with visual pulsing on the target DOM element, requiring explicit user confirmation at each stage. The DOM is re-read after every action to accommodate state transitions or user corrections.
- Hide: For content suppression, LLM-based semantic scoring identifies elements matching the user’s intent. Before suppression, a review dialog surfaces per-element justifications and snippets, supporting fine-grained user oversight.
Each mode invokes structured LLM prompts that enforce grounded evidence and explicit action specification, systematically reducing ambiguity and ensuring output inspectability.
DOM Representation and Routing
The system leverages SoM-style structured representations of the visible DOM (D), assigning indices and bounding boxes to each interactive or text-bearing element. This enables precise surface-level grounding without reliance on fragile heuristics or fixed CSS selectors. The LLM-based router, with ~98% accuracy across diverse task classes, dispatches queries with robust disambiguation between finding, guiding, and hiding intents.
Prompt Engineering and Model Integration
All interaction modes are implemented via modular LLM prompts that enforce direct DOM grounding. PageGuide supports Gemini-3-Flash as the preferred backbone, based on superior empirical performance in evidence retrieval, task guidance, and selective suppression over earlier Gemini-2.5-Flash and baselines including SeeAct and LED-base methods.
Empirical Evaluation
User Study Design
A controlled within-subject laboratory study (N=94) systematically compared unaided browsing (control) with PageGuide-enabled conditions. Participants completed tasks across all three modes, covering a breadth of websites and task complexities, with relevant behavioral metrics, objective outcomes, and subjective feedback collected.
Key Results
- Accuracy: PageGuide demonstrated significant gains in Hide (30%→56%, p<10−5) and Guide (23%→53%, p<10−8) modes, with moderate improvement in Find (81%→86%, p=0.32). The largest effect sizes were observed in procedural and content-filtering scenarios.
- Efficiency: Task completion time was reduced across all modes: Hide by over 70% (104s→32s, p<10−13), Guide by 29.1s, and Find by 12.4s (N=940 for Find). Manual effort metrics—such as Ctrl+F usage and scrolling—dropped by 80% and 60%, respectively.
- Behavioral Analytics: All interaction signals (clicks, mouse distance, text selection) reduced with PageGuide, except in procedural navigation where increased page visits reflected targeted guidance rather than aimless exploration.
- Perceived Usability: Likert-scale survey results aligned with objective measures: 91% of users found Find more accurate, 77–89% reported tasks were subjectively easier across all modes, and nearly three-fourths of users reported difficulty completing tasks without the extension in Guide/Hide scenarios. Notably, Guide yielded a higher proportion of partial completions, suggesting enhanced user persistence and engagement.
- Model Comparison: Gemini-3-Flash outperformed Gemini-2.5-Flash and benchmarked baselines in all three modes, achieving higher evidence recall, answer correctness, F1 scores on QASPER/Natural Questions, and task success rates on Online-Mind2Web.
Practical and Theoretical Implications
PageGuide empirically validates the hypothesis that direct DOM-grounding and mixed-initiative interaction radically improve user trust, efficiency, and transparency over opaque end-to-end automation or output-only grounding. In contrast to existing web agents and filtering tools, PageGuide integrates explainability and user oversight as first-class primitives. The approach is directly extensible to multi-modal and cross-document contexts (PDFs, image QA) as demonstrated in supplementary functionalities.
From a theoretical perspective, this work reinforces the value of transparency and user interactivity in agentic web systems, echoing themes from explainable AI and interactive XAI research [10.1145/374(2413.37891)34, highlightedcot]. By coupling every LLM reasoning step to observable page changes, PageGuide operationalizes faithfulness and mitigates the plausibility-vs-faithfulness gap observed in prior LLM analysis (Agarwal et al., 2024). Additionally, the mixed-initiative design parallels recent developments in copilot frameworks [cowpilot2025], facilitating step-level correction and intervention.
Limitations and Future Directions
The paper identifies several limitations:
- Single-mode routing: Current routing cannot fragment composite queries; future work should develop multi-step planners capable of orchestrating sequential Find–Guide–Hide invocations.
- Cross-page persistence: DOM highlights are limited to current page contexts; session-level aggregation and evidence tracking remain open problems.
- Guide overhead: Per-step confirmation is efficient for novices but introduces workflow friction for expert users. Adaptive granularity and richer multi-turn correction mechanisms are needed.
- Hide memory: The lack of persistent suppression profiles limits user personalization. Integration of cross-session preference learning and on-the-fly reapplication is a logical extension.
These challenges point to the need for adaptive, user-centric copilot architectures that learn from and generalize user interaction patterns.
Conclusion
PageGuide establishes a rigorous, empirically-backed framework for DOM-grounded web assistance, outperforming both unaided browsing and commercial web agents in accuracy, efficiency, and perceived usability. Its architectural emphasis on transparency, evidence, and user control articulates a clear technical path forward for trustworthy web agency. The modularity of the Find–Guide–Hide paradigm positions PageGuide as a foundation for future research in in-situ, interactive, multimodal web LLM systems, with implications for both robust deployment and explainable agent design in human-facing AI.
Reference: "PageGuide: Browser extension to assist users in navigating a webpage and locating information" (2604.23772)