Generalizability of findings across websites and domains
Determine whether, beyond the ServiceNow-based WorkArena L1 benchmark, LLM-based web agents exhibit the same dependence of task success on observation representation (HTML versus accessibility tree), model capability, and thinking token budget across other websites and domains.
References
Nevertheless, whether the findings generalize to other websites and domains remains unverified.
— Read More, Think More: Revisiting Observation Reduction for Web Agents
(2604.01535 - Enomoto et al., 2 Apr 2026) in Limitation, Section: Generalizability across websites and domains