AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

Published 5 Jun 2026 in cs.CV | (2606.07326v1)

Abstract: Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper presents a unified framework that integrates egocentric video synthesis conditioned on full-body 3D actions and anchor-view customization.
The methodology employs hybrid-view training and projection-based spatial grounding to achieve state-of-the-art metrics in scene consistency and camera accuracy.
The results illustrate improved dynamic evolution and robust performance across both simulated and real-world environments.

AnchorWorld: Embodied Egocentric World Simulation with View-Based Evolution Customization

Motivation and Problem Formulation

AnchorWorld addresses the longstanding challenge in interactive egocentric world modeling: achieving spatially precise human-action-conditioned visual synthesis while supporting flexible, local world-state customization and dynamic evolution. Unlike prior works that rely on coarse control signals (e.g., camera trajectories, text prompts) or limited bodily action cues, AnchorWorld formalizes world-customizable embodied egocentric simulation. This involves two distinct control streams:

Human Motion Control: Egocentric navigation and interaction are conditioned on full-body 3D actions, enabling nuanced correspondence between visual output and embodied agent behaviors—even outside the camera's field of view.
Anchor-View World Customization: World states are defined locally via anchor views, each comprising an appearance image, a spatial pose, and an evolution prompt, affording explicit, spatially grounded specification of scene content, including regions initially out-of-sight.

AnchorWorld intervenes by creating a unified framework where first-person and third-person viewpoints are jointly leveraged for robust action-driven synthesis, and anchor views enable temporally controlled, spatially coherent world evolution.

Figure 1: AnchorWorld synthesizes egocentric videos conditioned on human action and initial ego-view frame, and supports explicit world customization via anchor views.

Methodology: Unified Action and World-State Control

Hybrid-View Action Conditioning

AnchorWorld incorporates hybrid-view training: third-person videos provide full visibility of body motion and interactions for auxiliary supervision, while first-person videos deliver the natural egocentric perspective. Action conditioning is projection-based, pairing SMPL-X human motion with camera pose, fusing root global navigation and local limb interactions.

Spatial pose attention injects motion encoding and camera trajectory into the latent video representation, ensuring frame-wise alignment and robust spatial grounding. This approach circumvents view truncation inherent to egocentric data and benefits from rich TPV supervision.

Figure 2: Egocentric video synthesis conditioned on spatial pose-aware action streams and anchor views; anchor view attributes (RGB, pose, evolution prompt) are injected for world customization.

Anchor-View World Customization

World customization is realized through anchor views. Each anchor includes:

RGB Image: Specifies local appearance.
6-DoF Pose: Explicit spatial grounding within a unified world coordinate system.
Evolution Prompt: Semantic description governing local state changes.

Anchor views are contextually injected via concatenation of image latents and spatial pose embeddings, further facilitated by position-aware embedding (RoPE) to disambiguate views. Evolution prompts are incorporated through masked cross-attention, restricting semantic influence to relevant visual tokens for localized dynamics.

Progressive Multi-Stage Training

AnchorWorld employs progressive, multi-stage training:

Stage I: Pre-train on TPV action data for projection-based spatial grounding.
Stage II: Adapt to FPV action using head-aligned camera trajectory.
Stage III: Static anchor-view customization for pose-aware consistency.
Stage IV: Dynamic evolution with anchor-specific prompts for temporally evolving world states.
Figure 3: Multi-stage training progression, sequentially equipping action control and anchor-view customization.

Experimental Results

Quantitative Evaluation

AnchorWorld demonstrates superior quantitative performance across all relevant metrics:

Scene Consistency: Highest GIM-based matched pixel ratio ($4493.4$ for static ego scenes), CLIP-V semantic similarity ($0.885$), PSNR ($16.06$), SSIM ($0.578$), and lowest LPIPS ($0.470$).
Camera Accuracy: Lowest Absolute Translation Error ($0.112$), Relative Translation Error ($0.029$), and Relative Rotation Error ($3.145$).
Dynamic Evolution: Outperforms in VideoAlign text-alignment ($0.717$).
Video Quality: Matches or exceeds baselines on VBench composite measures ($0.748$ for static, $0.885$0 for dynamic scenes).

Strong generalization is observed on out-of-distribution UE and real-world scenes, particularly in cases with non-overlapping viewpoints and anchor-view manipulation.

Qualitative Analysis

AnchorWorld achieves robust egocentric action control and scene consistency under large viewpoint changes, and superior dynamic scene evolution driven by anchor-specific evolution prompts.

Figure 4: Qualitative comparison: improved egocentric action control, scene consistency, and dynamic evolution relative to baselines.

Figure 5: AnchorWorld generalizes to rendered UE scenes and real-world capture, maintaining spatial integrity across diverse settings.

Figures reveal precise spatial pose awareness, effective out-of-sight scene evolution, and anchor-specific control even when local appearance overlaps (or does not) with current ego-view.

Figure 6: Model infers state changes in dynamic elements beyond observed views, reasoning about scene evolution.

Figure 7: AnchorWorld maintains spatial pose awareness: pose flipping experiments demonstrate overlap-based detail retrieval.

Figure 8: Evolution prompt manipulation drives local scene transitions, confirming prompt-controllable dynamic evolution.

Ablation Studies and Analyses

Ablations on training stage composition, anchor-view pose encoding, projection-based control, and RoPE embedding affirm the necessity of each design, with marked deterioration in action accuracy and scene consistency upon removal of key components.

Increasing anchor views further improves scene consistency metrics, substantiating the model’s ability to integrate multi-view spatial constraints.

Limitations and Future Directions

AnchorWorld is constrained by the base model capacity, yielding inconsistent fine-grained details in locally complex regions and blurry artifacts when training data contains rapid viewpoint changes. Future directions include:

Long-horizon exploration: Extending to real-time, autoregressive, interactive scenarios by emphasizing persistent memory and updating world state in response to actions.
Open-world data: Incorporating broader environments and more diverse activities for robust generalization.
Multi-anchor dynamic controls: Enabling anchor-specific evolution prompts and modeling natural dynamic world evolution across arbitrary anchor views.
Figure 9: Failure cases: inconsistent texture under complex local structure and blur artifacts arising from rapid egocentric motion.

Conclusion

AnchorWorld establishes a scalable framework for embodied egocentric simulation, integrating hybrid-view action supervision with spatially explicit, anchor-driven world customization, further empowering text-driven local dynamic control. Extensive empirical validation underscores its superiority in action-conditioned synthesis, spatial consistency under viewpoint transitions, and flexible dynamic world evolution. This work paves the way for future advancements in embodied AI, interactive virtual environments, and customizable real-time simulation, with implications for robotics, AR/VR, and human-centric generative visual modeling.

Markdown Report Issue