Generalizability of findings across websites and domains

Determine whether, beyond the ServiceNow-based WorkArena L1 benchmark, LLM-based web agents exhibit the same dependence of task success on observation representation (HTML versus accessibility tree), model capability, and thinking token budget across other websites and domains.

Background

The study evaluates LLM-based web agents exclusively on WorkArena L1, a benchmark built on the ServiceNow platform, to analyze how observation representation (HTML vs. accessibility tree), model capability, and thinking token budget affect task success.

Because the experiments are confined to a single real-world site and domain, it is uncertain whether the observed performance patterns and design recommendations would hold for other web ecosystems or different application domains.

References

Nevertheless, whether the findings generalize to other websites and domains remains unverified.

Read More, Think More: Revisiting Observation Reduction for Web Agents  (2604.01535 - Enomoto et al., 2 Apr 2026) in Limitation, Section: Generalizability across websites and domains