This paper introduces "Embodied Web Agents," a new paradigm for AI systems designed to integrate physical embodiment with web-scale knowledge access. The core problem addressed is the current siloing of AI agents: some excel at digital information retrieval and reasoning (web agents), while others interact with the physical world (embodied agents), but rarely do both effectively. This separation limits their ability to perform tasks requiring integrated physical and digital intelligence, such as cooking from an online recipe, navigating using real-time map data, or researching a physical landmark online.
To operationalize this concept, the authors make several key contributions:
- Embodied Web Agents Task Environments: They developed a unified simulation platform that combines:
- Indoor environments: Using AI2-THOR for realistic 3D kitchen scenes where agents can manipulate objects (e.g., slice, cook) based on online recipes. Object states (e.g.,
isCooked
) are tracked. - Outdoor environments: Leveraging Google Street View and Google Earth APIs for real-world street-level navigation in cities like New York, Boston, Philadelphia, and Pittsburgh. This environment is represented as a navigation graph where nodes are GPS coordinates with associated visual observations.
- Web environments: A set of functional websites built with React.js (frontend) and FastAPI (backend). These include a homepage, a custom recipe website, a custom online shopping site, and adapted versions of OpenStreetMap and Wikipedia from the WebArena benchmark.
The environments are formalized as , where is the combined physical-digital state space, is the action space spanning both domains (e.g.,
MoveAhead
in physical,click [id]
on web,switch_environment
), is the observation space (embodied and web ), and is the transition function. - Indoor environments: Using AI2-THOR for realistic 3D kitchen scenes where agents can manipulate objects (e.g., slice, cook) based on online recipes. Object states (e.g.,
Embodied Web Agents Benchmark: Building on this platform, they constructed a benchmark with approximately 1.5k tasks across five domains:
- Cooking (911 tasks): Agents match physical ingredients in AI2-THOR with online recipes from the custom website, potentially shopping online for missing items. Recipes are refined by Claude to be executable in AI2-THOR, and confounders (e.g., different recipes with the same name but varying difficulty or ingredients) are introduced to increase complexity.
- Navigation (144 tasks): Agents use OpenStreetMap to get directions and then navigate in the outdoor environment. Tasks involve generating start/end points using GPT-4o-mini and using Dijkstra's algorithm for ground-truth paths.
- Shopping (216 tasks): Agents compare prices/locations on the custom shopping website, place orders, and navigate to physical store locations (simulated in Manhattan) for pickup using OpenStreetMap.
- Traveling (110 tasks): Agents navigate to landmarks in the outdoor environment and use Wikipedia to gather information about them, requiring grounding of web descriptions to physical observations.
- Geolocation (142 tasks): Agents explore the outdoor environment and use web search (Wikipedia) to determine their current geographic coordinates, moving beyond single-image prediction.
Implementation Details and Action Spaces:
The paper provides a table of actions available to the agent across the different environments:
Environment | Action Category | Action Examples |
---|---|---|
Indoor (AI2-THOR) | Agent Movement | Teleport [obj] , MoveAhead/Back/Left/Right |
Object Interaction | PickupObject / PutObject [obj] |
|
Object State Changes | OpenObject / CloseObject [obj] , SliceObject [obj] , CookObject [obj] |
|
Environment Switching | switch_environment [msg] |
|
Outdoor (Google API) | Movement | Forward / Left / Right |
Web (Custom/Adapted) | Page Operation | click [id] , type [id] [content] [pr] , scroll [direction] , hover [id] , press [key_comb] |
Tab/URL Navigation | new_tab , close_tab , tab_focus , goto [url] , go_back / forward |
Experiments and Results:
The benchmark was tested with state-of-the-art LLM agents: GPT-4o, Gemini 2.0 Flash, Qwen-VL-Plus, and InternVL2.5-latest. Human performance was also measured as a baseline.
Key findings include:
- Significant Performance Gaps: Current LLM agents perform substantially worse than humans. For instance, in cooking, the best text-based GPT-4o model achieved 6.4% overall accuracy compared to 77.08% for humans. In outdoor navigation, GPT-4o achieved 34.72% accuracy versus 90.28% for humans.
- Web vs. Embodied Performance: Models generally perform better on web-only sub-tasks than embodied-only sub-tasks, indicating stronger digital reasoning than physical interaction and grounding. For example, in cooking, GPT-4o (text) had 57.08% web accuracy but only 10.5% embodied accuracy.
- Cross-Domain Challenges: Error analysis, particularly for cooking tasks with GPT-4o, revealed that cross-domain errors (66.6%) are the dominant cause of failures. These include agents getting "trapped" in one domain (e.g., repeatedly acting in the physical world without consulting the web, or vice-versa) and misalignment between web instructions and embodied actions.
- Task Complexity: More complex tasks like shopping and traveling, which involve longer interaction sequences and richer cross-domain interplay, showed lower overall accuracies than relatively simpler navigation.
- Geolocation Improvement: Embodied web agents capable of active exploration and web querying significantly outperformed passive VLM baselines (like FairLocator) in geolocation tasks, highlighting the benefit of integrated intelligence.
- Modality in Cooking: For cooking, text-based agents (using scene graphs from AI2-THOR) outperformed vision-based agents (using first-person screenshots), suggesting challenges in visual grounding for current models in complex interactive tasks.
Practical Implications:
The research highlights that building truly integrated AI agents requires more than just combining existing web and embodied systems. The key challenges lie at the intersection of these domains.
- Perceptual Grounding: Linking abstract digital instructions (e.g., recipe steps) to high-dimensional physical world observations (e.g., visual state of food) is crucial.
- Cross-Domain Planning: Agents need to intelligently decide when to switch between physical actions and digital information retrieval, especially when information from one domain might contradict or supplement the other.
- Coherent Representation: Maintaining a persistent representation that bridges physical and digital contexts is necessary.
The publicly available datasets, code, and environments aim to foster research in this direction. The benchmark provides a systematic way to assess and drive progress in AI systems that can fluidly operate across physical and digital realms. The findings suggest that future work should focus on improving cross-domain reasoning, planning, and grounding capabilities of AI agents. A limitation acknowledged is the reliance on simulated environments, which may not fully capture real-world complexities.