Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence (2506.15677v2)

Published 18 Jun 2025 in cs.AI, cs.CL, cs.CV, cs.MM, and cs.RO

Abstract: AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

PDF Abstract

This paper introduces "Embodied Web Agents," a new paradigm for AI systems designed to integrate physical embodiment with web-scale knowledge access. The core problem addressed is the current siloing of AI agents: some excel at digital information retrieval and reasoning (web agents), while others interact with the physical world (embodied agents), but rarely do both effectively. This separation limits their ability to perform tasks requiring integrated physical and digital intelligence, such as cooking from an online recipe, navigating using real-time map data, or researching a physical landmark online.

To operationalize this concept, the authors make several key contributions:

Embodied Web Agents Task Environments: They developed a unified simulation platform that combines:
- Indoor environments: Using AI2-THOR for realistic 3D kitchen scenes where agents can manipulate objects (e.g., slice, cook) based on online recipes. Object states (e.g., isCooked) are tracked.
- Outdoor environments: Leveraging Google Street View and Google Earth APIs for real-world street-level navigation in cities like New York, Boston, Philadelphia, and Pittsburgh. This environment is represented as a navigation graph where nodes are GPS coordinates with associated visual observations.
- Web environments: A set of functional websites built with React.js (frontend) and FastAPI (backend). These include a homepage, a custom recipe website, a custom online shopping site, and adapted versions of OpenStreetMap and Wikipedia from the WebArena benchmark.
The environments are formalized as $E = \langle S, A, O, T \rangle$ , where $S$ is the combined physical-digital state space, $A$ is the action space spanning both domains (e.g., MoveAhead in physical, click [id] on web, switch_environment), $O$ is the observation space (embodied $o^e_t$ and web $o^w_t$ ), and $T$ is the transition function.
Embodied Web Agents Benchmark: Building on this platform, they constructed a benchmark with approximately 1.5k tasks across five domains:
- Cooking (911 tasks): Agents match physical ingredients in AI2-THOR with online recipes from the custom website, potentially shopping online for missing items. Recipes are refined by Claude to be executable in AI2-THOR, and confounders (e.g., different recipes with the same name but varying difficulty or ingredients) are introduced to increase complexity.
- Navigation (144 tasks): Agents use OpenStreetMap to get directions and then navigate in the outdoor environment. Tasks involve generating start/end points using GPT-4o-mini and using Dijkstra's algorithm for ground-truth paths.
- Shopping (216 tasks): Agents compare prices/locations on the custom shopping website, place orders, and navigate to physical store locations (simulated in Manhattan) for pickup using OpenStreetMap.
- Traveling (110 tasks): Agents navigate to landmarks in the outdoor environment and use Wikipedia to gather information about them, requiring grounding of web descriptions to physical observations.
- Geolocation (142 tasks): Agents explore the outdoor environment and use web search (Wikipedia) to determine their current geographic coordinates, moving beyond single-image prediction.

Implementation Details and Action Spaces:

The paper provides a table of actions available to the agent across the different environments:

Environment	Action Category	Action Examples
Indoor (AI2-THOR)	Agent Movement	`Teleport [obj]`, `MoveAhead/Back/Left/Right`
	Object Interaction	`PickupObject / PutObject [obj]`
	Object State Changes	`OpenObject / CloseObject [obj]`, `SliceObject [obj]`, `CookObject [obj]`
	Environment Switching	`switch_environment [msg]`
Outdoor (Google API)	Movement	`Forward / Left / Right`
Web (Custom/Adapted)	Page Operation	`click [id]`, `type [id] [content] [pr]`, `scroll [direction]`, `hover [id]`, `press [key_comb]`
	Tab/URL Navigation	`new_tab`, `close_tab`, `tab_focus`, `goto [url]`, `go_back / forward`

Experiments and Results:

The benchmark was tested with state-of-the-art LLM agents: GPT-4o, Gemini 2.0 Flash, Qwen-VL-Plus, and InternVL2.5-latest. Human performance was also measured as a baseline.

Key findings include:

Significant Performance Gaps: Current LLM agents perform substantially worse than humans. For instance, in cooking, the best text-based GPT-4o model achieved 6.4% overall accuracy compared to 77.08% for humans. In outdoor navigation, GPT-4o achieved 34.72% accuracy versus 90.28% for humans.
Web vs. Embodied Performance: Models generally perform better on web-only sub-tasks than embodied-only sub-tasks, indicating stronger digital reasoning than physical interaction and grounding. For example, in cooking, GPT-4o (text) had 57.08% web accuracy but only 10.5% embodied accuracy.
Cross-Domain Challenges: Error analysis, particularly for cooking tasks with GPT-4o, revealed that cross-domain errors (66.6%) are the dominant cause of failures. These include agents getting "trapped" in one domain (e.g., repeatedly acting in the physical world without consulting the web, or vice-versa) and misalignment between web instructions and embodied actions.
Task Complexity: More complex tasks like shopping and traveling, which involve longer interaction sequences and richer cross-domain interplay, showed lower overall accuracies than relatively simpler navigation.
Geolocation Improvement: Embodied web agents capable of active exploration and web querying significantly outperformed passive VLM baselines (like FairLocator) in geolocation tasks, highlighting the benefit of integrated intelligence.
Modality in Cooking: For cooking, text-based agents (using scene graphs from AI2-THOR) outperformed vision-based agents (using first-person screenshots), suggesting challenges in visual grounding for current models in complex interactive tasks.

Practical Implications:

The research highlights that building truly integrated AI agents requires more than just combining existing web and embodied systems. The key challenges lie at the intersection of these domains.

Perceptual Grounding: Linking abstract digital instructions (e.g., recipe steps) to high-dimensional physical world observations (e.g., visual state of food) is crucial.
Cross-Domain Planning: Agents need to intelligently decide when to switch between physical actions and digital information retrieval, especially when information from one domain might contradict or supplement the other.
Coherent Representation: Maintaining a persistent representation that bridges physical and digital contexts is necessary.

The publicly available datasets, code, and environments aim to foster research in this direction. The benchmark provides a systematic way to assess and drive progress in AI systems that can fluidly operate across physical and digital realms. The findings suggest that future work should focus on improving cross-domain reasoning, planning, and grounding capabilities of AI agents. A limitation acknowledged is the reliance on simulated environments, which may not fully capture real-world complexities.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yining Hong (23 papers)
Rui Sun (105 papers)
Bingxuan Li (19 papers)
Xingcheng Yao (10 papers)
Maxine Wu (2 papers)
Alexander Chien (1 paper)
Da Yin (35 papers)
Ying Nian Wu (138 papers)
Zhecan James Wang (1 paper)
Kai-Wei Chang (292 papers)

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/yining_hong/status/1935520821243765244

https://twitter.com/minchoi/status/1936063359222153435

https://twitter.com/_akhaliq/status/1935515271298707550

https://twitter.com/UnityEagle/status/1937058847140581528

https://twitter.com/rohanpaul_ai/status/1935758577236689161

https://twitter.com/amoufarek/status/1935623865935241702

YouTube

Show All Videos