GuideWeb Agent: In-App Guide Automation

Updated 9 February 2026

GuideWeb Agent is a system that automatically identifies interactive web elements and produces aligned in-app guidance using a data-driven method.
It utilizes a 'Shorter' module to efficiently reduce HTML input size by pruning non-informative sections, ensuring effective processing.
The agent is benchmarked with metrics like F1, BLEU, and ROUGE, outperforming baselines and setting a standard for digital adoption platforms.

GuideWeb Agent is an agentic system introduced in conjunction with the GuideWeb benchmark to address the problem of automatic in-app guide generation for real-world web user interfaces. The system formulates and empirically studies the challenge of selecting which page elements should be annotated with operational guidance, and generating aligned, concise instructional text, all in an end-to-end, data-driven manner. It establishes a rigorous benchmark and serves as a baseline agent, evaluating the key steps, architectural features, and limitations in the broader context of web-based digital adoption platforms and in-app guidance automation (Gan et al., 2 Feb 2026).

1. Benchmark Formulation and Task Definition

The GuideWeb benchmark operationalizes the end-to-end process of guide creation as a structured, supervised prediction task over static HTML snapshots of popular website main pages. Each page $x \in \mathcal{X}$ is represented by its raw HTML, from which a DOM tree is parsed and a set $E(x) = \{e_1, ..., e_N\}$ of interactive elements is extracted. Each element $e$ is annotated with attributes $\varphi(e) = (\text{tag}(e), \text{visible\_text}(e), \text{xpath}(e))$ .

The agent must output an annotation $y = (g, \mathcal{A})$ , comprising:

$g \in \{0, 1\}$ : binary flag indicating whether the page needs guidance.
$\mathcal{A} = \{a(e) \mid e \in E^+\}$ $A = {a (e) ∣ e \in E^{+}}$ : set of per-element guide annotations for selected targets $E^+ \subseteq E(x)$ $E^{+} \subseteq E (x)$ , with each annotation $a(e)$ $a (e)$ including:
- $i(e)$ : natural-language intent explaining the user's reason for interacting with $e$ ,
- $t(e)$ : action type (categorical, e.g., "search", "login"),
- $s(e)$ : concise guide text aligned to user intent,
- $\varphi(e)$ : necessary grounding fields to facilitate overlay attachment.

This two-stage formulation—(1) guide target identification, (2) element-grounded generation—requires the agent to be selective (discriminating which elements merit annotation) and informative (generating coherent, context-sensitive text).

2. System Architecture and Processing Pipeline

The GuideWeb Agent is based on a lightweight Transformer model fully fine-tuned on the custom-constructed GuideWeb dataset. Its inference pipeline comprises two principal modules:

a. Shorter Preprocessing

The "Shorter" module aggressively reduces input length in real-world HTML without sacrificing relevant content. It strips large non-informative blocks, prunes deep noninteractive subtrees, compresses headers, and retains a compact representation containing at most 2,000 interactive candidate elements with local context. This empirically reduces token length by about 60% and is essential for tractable processing without model degradation.

b. Unified JSON Prompt and Structured Output

The processed HTML-like sequence is concatenated with an instruction template enforcing the JSON output schema. During inference, in a single encoder-decoder pass:

The model predicts $g$ .
If $g = 1$ , it enumerates XPaths of selected elements.
For each, it emits $i(e)$ , $t(e)$ , and $s(e)$ , then closes the JSON structure.

All model parameters are updated on the GuideWeb train split for three epochs (NVIDIA GB10 GPU, $\text{lr} = 10^{-5}$ , mixed precision). The agent's design ensures both data and output schema compliance needed for real DAP interoperability.

3. Evaluation Metrics and Formal Definitions

Performance assessment in GuideWeb comprises three principal metrics:

Guide-target F1 (Accuracy): For each interactive element instance,

$\text{Acc} = \frac{1}{N}\sum_{i=1}^{N} \mathbb{1}(\hat{e}_i = e_i)$

with F1 over the positive ("guide") class reported as the main metric (Agent: 30.79%).

BLEU (Intent/Guide-Text): Corpus-level BLEU (up to 4-grams; $w_n = 1/4$ ) with modified n-gram precision and brevity penalty,

$\text{BLEU} = BP \exp\Big(\sum_{n=1}^4 w_n \log p_n\Big)$

BLEU_intent: 44.94, BLEU_guide: 21.34.

ROUGE-L and exact-match F1: Supplementary metrics for coverage and structure (ROUGE-L for fluency/recall, F1 for grounding fields).

These standardized metrics support reproducible, fine-grained comparison with baselines and among subcomponents.

4. Experimental Setup and Baselines

Data Construction: 1,000 domains sampled from Cisco Umbrella Top-1M, with crawling and filtering ensuring high interactivity. Annotation of 996 pages was conducted via LLM pipeline plus human verification; 98.4% of pages require at least one guide (mean: 3.09 guides/page, capped at 5). Dataset split: 75% train, 25% test.
Baselines: Zero-shot predictions from five major LLMs (OpenAI GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Qwen3-8B, LLaMA3.1-8B), under a unified prompt with context windows up to 130K tokens, and no further tuning.
Ablation: Input reduction effect measured by disabling/enabling "Shorter" in Qwen3-8B, confirming ~4-point F1 improvement via input pruning.

This robust evaluation ensures competitive context and clarity on the model’s contributions.

5. Empirical Results and Error Analysis

Guide Target Selection:

GuideWeb Agent achieves 30.79% F1 (≈ Acc), outperforming baselines (4.42–25.16%) which systematically over-predict and suffer from low precision (<16%) despite higher recall (up to ~59%).
Agent’s precision/recall: 29.99%/31.64%, demonstrating a balanced trade-off.

Text Generation:

BLEU_intent: 44.94, ROUGE-L: 52.89 (intent); BLEU_guide: 21.34, ROUGE-L: 28.44 (guide text), all significantly exceeding baseline scores (e.g., BLEU_guide <5).
Substantial absolute gains (∼30 BLEU points for intent).

Shorter Mechanism:

Use of input reduction increases recall (12.33% to 29.67%) and F1 by ~4 points.

Qualitative Failure Analysis:

Common errors: over-annotating visually obvious controls (e.g., "Submit"), misidentifying decorative icons, or producing generic/unhelpful guide texts.

6. Limitations and Future Directions

Current limitations:

Restriction to single-page, snapshot-based guidance—multi-step flows and dynamic, stateful interactions are out of scope.
Dataset scale remains moderate (996 annotated pages).
Supervised offline training regime—agent does not adapt to real-time user feedback or evolving web layouts at inference.

Proposed future work:

Extension to multi-page, task-oriented guidance and stateful workflows.
Integration of interactive user feedback for continual learning and adaptation.
Online benchmarking within deployed DAP tools.

GuideWeb Agent establishes a foundational baseline for automatic guide generation in DAP scenarios, directly addressing the need for scalable, maintainable in-app guidance as website layouts evolve. Its joint selection and generation formalization, efficient input reduction mechanisms, and structured output schema collectively define a reproducible framework for future system development and comparative benchmarking. Further advances, particularly in grounded intent inference, context adaptation, and interactive feedback incorporation, remain critical for practical deployment and reliability in production digital adoption settings (Gan et al., 2 Feb 2026).

Markdown Upgrade to Chat

References (1)

GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GuideWeb Agent.