Papers
Topics
Authors
Recent
Search
2000 character limit reached

HANSEL: Extracting Breadcrumbs from Web Agent Trajectories for Interactive Verification

Published 17 Jun 2026 in cs.HC | (2606.18671v1)

Abstract: AI web agents can perform complex, multi-step tasks such as searching for products, comparing options, and making purchases on behalf of users. However, verifying the correctness of an agent's output remains difficult. Existing transparency mechanisms, including full trajectory logs, source links, screenshots, and LLM-generated summaries, treat verification as a passive reading task, leaving users to sift through overwhelming logs or trust potentially unfaithful explanations. We present HANSEL (Highlighting Agent Navigation Steps as Evidence Links), a system that extracts interactive, verifiable evidence from web-agent trajectories. Given an agent trajectory, HANSEL extracts evidence pages and snippets and presents them as navigable, interactive views with relevant page state preserved (e.g., applied filters, search queries, and scroll positions), enabling users to verify how the agent arrived at its answer. When the agent's answer cannot be traced to any visited page, HANSEL explicitly flags this gap. A technical evaluation on 45 tasks from AssistantBench and Online-Mind2Web shows that HANSEL achieves 83.7% precision and 88.8% recall in identifying evidence pages, while reducing trajectory volume by 61.6%. In a controlled user study with 14 participants, HANSEL significantly reduced task completion time and perceived effort compared to a standard agent interface, while participants rated it significantly higher on usability, verification ease, and error identification. Our results demonstrate that reframing verification as an interactive activity, rather than passive consumption of agent explanations, leads to more efficient human oversight of AI agents.

Authors (2)

Summary

  • The paper presents a novel interactive system that extracts minimal and sufficient evidence from web agent trajectories to enhance verification.
  • It reconstructs web pages with precise state, allowing users to interactively verify agent reasoning and correct errors seamlessly.
  • Empirical results show significant reduction in verification time and effort, improving accuracy and usability for complex web tasks.

HANSEL: Interactive Evidence Extraction for Web Agent Trajectory Verification

Motivation and Limitations of Current Transparency Approaches

The rapid progress in LLM-based web agents has enabled delegation of complex, multi-step web tasks such as product search, option comparison, and transaction execution. However, robust verification of agent outputs remains challenging, particularly as web agents frequently exhibit suboptimal grounding, lose track of task objectives, or miss critical details, leading to potentially erroneous conclusions even amidst plausible reasoning traces. Existing transparency paradigms—comprehensive trajectory logs, source links, screenshots, and LLM-generated summaries—are fundamentally limited: they require users to passively consume overwhelming information or accept potentially unfaithful explanations, often failing to surface actionable state or verifiable evidence.

Consequently, effective human oversight is hampered by information overload and insufficient evidence traceability. Detailed logs are indiscriminately verbose, source links lack page state contextualization, screenshots are static, and LLM summaries are susceptible to rationalizing incorrect decisions equivalently to correct ones. The verification burden is inherently shifted onto users, precluding interactive scrutiny and revision.

HANSEL System Design and Evidence Extraction Pipeline

HANSEL (Highlighting Agent Navigation Steps as Evidence Links) reframes verification as an interactive process by extracting minimal, actionable evidence from web agent trajectories and presenting them as live, navigable views. The system operationalizes three primary design goals:

  • Minimal Sufficient Evidence Exposure: Only pages directly supporting the agent’s answer are surfaced, reducing cognitive load and information volume.
  • Direct Evidence Access with Page State Reconstruction: Evidence pages are reconstructed with preserved environmental state (applied filters, search queries, scroll positions) and embedded as interactive webviews, permitting users to probe, modify, and verify in situ.
  • Evidence Gap Visibility: HANSEL flags unverifiable claims or reasoning steps, facilitating error awareness and intervention.

The extraction pipeline standardizes heterogeneous agent logs by decomposing them into an ordered sequence of (observation,reasoning,action)(\text{observation},\text{reasoning},\text{action}) tuples. An LLM is then prompted with the user query, agent answer, and standardized trajectory to identify evidence pages and corresponding evidence snippets. Abandoned or unsuccessful plans, blocked pages, and navigation-only steps are excluded. Evidence snippets localize task-relevant content within each evidence page, supporting granular verification. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Overview of HANSEL’s extraction and presentation of evidence pages with preserved state and highlighted snippets for interactive verification.

Empirical Trajectory Analysis

Manual annotation of 45 tasks from AssistantBench and Online-Mind2Web benchmarks reveals acute information overload: agents execute an average of 13.16 steps and visit 6.02 pages per task, but only 3.33 steps (25.34%) and 2.18 pages (36.16%) directly contribute to answers. In 62.22% of cases, fewer than three pages suffice for verification. The analysis substantiates that a concise set of evidence pages is sufficient and necessary for efficient human oversight.

Technical Evaluation: Evidence Extraction Quality

HANSEL's extraction module achieves strong numerical results over 45 tasks:

  • Evidence Page Extraction F1: 0.861 (Precision: 0.837, Recall: 0.888)
  • Snippet-level Precision: 0.887

The system reduces information volume by 61.6%, distilling 271 trajectory pages to 104 evidence pages, enabling users to focus only on salient reasoning steps and evidence.

Interactive Evidence Interface and Reasoning Visualization

HANSEL’s interface embeds evidence pages as live webviews, reconstructing the agent’s navigational context via action replay and highlighting extracted snippets. Two complementary layouts are provided:

  • Grid View: Exposes all evidence pages in timestamp order, supporting holistic reasoning verification and rapid error detection.
  • Carousel View: Enables sequential inspection with page-level summaries and detailed descriptions.

Key agent actions—such as filter applications, sort changes, and constraint settings—are visually surfaced, allowing users to interact directly with evidence and correct minor agent errors without external navigation.

User Study: Impact on Verification Efficiency and Perceived Usability

A controlled user study with 14 participants demonstrates that HANSEL substantially improves verification efficiency and subjective experience compared to a standard conversational agent interface:

  • Task Completion Accuracy: 75.0% (baseline) vs. 82.14% (HANSEL)
  • Completion Time (correct answers only): 130.4s (HANSEL) vs. 167.7s (baseline); reduction is statistically significant (p<.001p < .001)
  • Perceived Effort (Likert scale): Median 2.89 (HANSEL) vs. 4.39 (baseline); significant reduction (p=.012p = .012)
  • Unique Pages Visited per Task: HANSEL users consistently browsed fewer pages, relying on surfaced evidence rather than reconstructing agent trajectories manually. Figure 2

    Figure 2: Task completion time per interface, showing HANSEL’s significant reduction in verification duration for correct answers.

    Figure 3

    Figure 3: Unique pages visited during tasks, evidencing HANSEL’s reduction in verification navigation overhead.

All participants rated HANSEL higher on usability, verification ease, error identification, understanding agent reasoning, and correction capabilities, and strongly preferred it over the baseline for post-hoc agent oversight. Figure 4

Figure 4: Post-study survey responses comparing HANSEL and the baseline, highlighting marked preference and perceived utility of HANSEL’s interactive features.

Qualitative Findings and Theoretical Implications

The study reveals divergent verification strategies: HANSEL users primarily engage evidence pages directly, using highlights and state reconstruction to identify errors and revise responses efficiently, especially on unfamiliar domains. Baseline users often reconstruct agent workflows externally, increasing effort and risk of missed errors. HANSEL’s interactive evidence presentation shifts the verification paradigm from passive consumption to active engagement, though overreliance on agent-surfaced evidence can induce miscalibrated trust if users do not critically scrutinize highlighted artifacts.

This finding aligns with existing literature on AI-assisted decision-making, demonstrating that evidence presentation substantially shapes user trust orientation and verification rigor. Interactive explanation interfaces, as instantiated in HANSEL, foster more effective oversight and error detection, but require further design to calibrate confidence and encourage critical engagement.

Limitations and Future Directions

HANSEL’s performance is contingent on trajectory quality and alignment with user mental models. Divergent agent strategies may surface suboptimal evidence, hindering verification. The study’s single-session design and platform-specific interaction constraints limit generalizability. Long-term adaptation, scalable deployment with robust page reconstruction, and adaptation for tasks lacking clear ground truths are open research challenges.

Potential future work includes:

  • Adaptive prompting to guide critical inspection of evidence, especially for claim types prone to agent overconfidence or error.
  • Automated uncertainty flagging and cognitive forcing functions to mitigate overreliance.
  • Extension of evidence artifacts beyond web tasks to preference-driven or open-ended scenarios, supporting iterative human-agent collaboration.

Conclusion

HANSEL introduces an interactive evidence extraction and verification paradigm for web agent oversight, operationalizing minimal, actionable evidence from agent trajectories and presenting it as live, navigable, and highlight-augmented views. The system achieves high evidence extraction quality and substantially reduces verification time and effort. These results substantiate that interactive verification interfaces enable more efficient, accurate, and usable agent output verification, motivating adoption and further research into robust, user-centered agent transparency mechanisms for real-world applications of LLM-based web agents (2606.18671).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.