Shared Embodied View in Collaborative Systems
- Shared Embodied View is a paradigm where agents align sensorimotor inputs and conceptual schemas to create synchronized, immersive representations.
- Implementation strategies range from VR head-pose anchoring to multimodal transformer architectures that fuse vision, language, and proprioception.
- Empirical studies reveal enhanced embodiment and agency, reducing transformation costs while improving coordination in dynamic, multi-agent environments.
A shared embodied view is a state or system architecture in which two or more agents—human or artificial—maintain mutually consistent, bodily situated, and dynamically aligned representations of the world, actions, and experiences. This concept is instantiated in diverse domains including collaborative VR/AR teleoperation, multimodal AI, reference resolution in language and gesture, joint task performance, and computational modeling of emotion. Technically, achieving a shared embodied view involves bidirectional sharing or transformation of viewpoints, control, perceptual feedback, and semantic grounding so that the physical, cognitive, and affective boundaries between participants become fluid or unified, thereby enhancing coordination, agency, and alignment. A shared embodied view can be realized via direct sensorimotor alignment, perspective synchronization, multimodal representation fusion, or symbolically coordinated conceptual structures.
1. Formal Definitions and Theoretical Foundations
The shared embodied view (SEV) paradigm in VR teleoperation is defined as a perspective in which a guest user's virtual camera is rigidly anchored to the host’s head position and body pose, while camera orientation is independently controlled by the guest’s HMD (Zhou et al., 31 Jan 2026). Theoretical motivation comes from hand–eye coordination and telepresence literature, showing that sharing an operator’s first-person view minimizes transformation costs between proprioception and visual feedback. SEV supports a strong sense of ownership and agency (“self-embodiment”) by simplifying mapping between limb control and avatar response.
In embodied AI and social cognition, shared embodied view is extended to computational architectures where multiple agents align their sensorimotor representations, affordances, and conceptual schemas for robust common grounding (Incao et al., 2024, Olivier et al., 31 Mar 2025). In collaborative multi-agent settings, meaning arises only when symbols are grounded “in a shared history of bodily interactions, temporally structured expectations, and social negotiation” (Incao et al., 2024). Shared embodiment is instantiated at three levels: active bodily system (sensorimotor anchoring and translation), temporally structured experience (coherent predictive processing across agents), and social skills for common ground (joint attention, perspective taking, referential negotiation).
2. Technical Architectures and Implementation Strategies
Teleoperation and Avatar Sharing
The canonical SEV setup employs two HMDs (e.g. Meta Quest 3), where:
- Guest’s camera position is locked to the host’s head (world coordinates).
- Camera orientation derives from the guest’s HMD rotation.
- Rendering provides a full-screen egocentric avatar view for the guest (Zhou et al., 31 Jan 2026).
Synchronization is managed via low-latency networking (e.g., Photon Unity Networking, RTT <5 ms). Guest and host roles can be alternated, and system logs head kinematics and supernumerary limb (VSL) inputs.
Multi-User Collaboration and Spatial Alignment
SPARC architecture for multi-user VR collaboration implements shared embodied view by:
- Rotating each user's workspace into a common shared perspective, mapping all participants’ actions into a canonical frame.
- Dynamically stretching only remote avatar limbs (not bodies) for deictic gestures, minimizing occlusion while maintaining avatar separability at the table (Simões et al., 2024).
Coordinate transformations and dynamic limb distortion are mathematically formalized, enabling seamless, orientation-invariant reference and gestural communication.
Multimodal Representation Learning
Arcadia’s embodied learning framework achieves a shared embodied view by fusing vision, language, and proprioceptive modalities into a joint Transformer-based representation (Qwen2.5-VL backbone), with task-specific heads for navigation and manipulation (Gao et al., 25 Nov 2025). All input modalities are projected into a unified token sequence, processed by the shared encoder, and split only at the last layer—enforcing cross-task grounding and transfer.
Mirror-neuron–inspired architectures directly align observed and executed action representations via contrastive learning in a shared latent space, maximizing mutual information and cross-modal synergy (Zhu et al., 25 Sep 2025).
Cross-Agent Perspective and Symbolic Grounding
The REP framework for Embodied Reference Understanding builds a shared embodied view by transforming the receiver’s first-person image into the sender’s world-centered coordinate frame, estimating sender orientation, and fusing linguistic/gestural cues via attention and multimodal transformers. This geometric and semantic realignment is essential for resolving referential expressions dependent on viewpoint change (Shi et al., 2023).
Neurosymbolic approaches use LLM-augmented parsing and formal spatial logics (Quantified Equilibrium Logic, Declarative Spatial Reasoning) to ensure that human and AI representations of space, narrative, and force-dynamics are structurally equivalent at the conceptual level (Olivier et al., 31 Mar 2025).
3. Empirical Effects: Embodiment, Agency, and Coordination
Quantitative and qualitative results consistently reveal the following patterns:
- Embodiment: SEV maximizes Avatar Embodiment Questionnaire (AEQ) measures of agency, ownership, presence, and self-location (all p<.001, effect size d≈0.8–1.0) compared to stabilized or third-person views (Zhou et al., 31 Jan 2026). Participants report the strongest “being the avatar” sensations under SEV.
- Performance-Coordination Trade-offs: Out-of-body or stabilized views improve navigation efficiency (OOB ~30% faster), reduce errors (~20% lower for OOB), and decrease physiological stress (HRV up by ~7 ms, p<.05) but at the expense of reduced embodiment (Zhou et al., 31 Jan 2026).
- Task and Role Dependency: Coordination breaks down during sustained host or partner motion in SEV, inducing guest discomfort and elevated error rates. Hosts may hesitate to move to avoid disrupting guests, indicating friction between co-embodiment and independent action.
- Co-presence and Social Bonding: In co-viewing or body-swapping paradigms, visually embodied expressions increase perceived community, emotional contagion, and shared expressive norms, but also introduce tension between individual immersion and social accommodation (Ohara et al., 26 May 2025, He et al., 11 Sep 2025).
4. Methodologies for Building and Evaluating Shared Embodied View
| Approach | Key Methodological Elements | Relevant Citation |
|---|---|---|
| VR Teleoperation | Head-pose anchoring, camera orientation fusion, AEQ/HRV | (Zhou et al., 31 Jan 2026) |
| Collaboration (SPARC) | Shared perspective rotation, dynamic limb distortion, NASA-TLX, movement economy | (Simões et al., 2024) |
| Multimodal AI (Arcadia) | Transformer-based cross-modal fusion, navigation/manipulation task heads, loss ablation | (Gao et al., 25 Nov 2025) |
| Mirror-Neuron Alignment | Bidirectional InfoNCE, shared latent action space alignment | (Zhu et al., 25 Sep 2025) |
| Reference Understanding | View-rotation, sender–receiver relation, multimodal reasoning, IoU benchmarking | (Shi et al., 2023) |
| Neurosymbolic Schema | LLM-to-logic translation, ASP-based simulation, conceptual transfer | (Olivier et al., 31 Mar 2025) |
Experimental setups employ both behavioral measures (task time, error, motion synchrony), physiological metrics (HR, HRV, fatigue indices), and subjective questionnaires (agency, ownership, co-presence, IOS). Statistical analysis typically involves mixed-effects models, repeated-measures ANOVA, or permutation testing for spatial and temporal dependencies.
5. Limitations, Trade-Offs, and Design Guidelines
SEV/enabled architectures exhibit role-dependent trade-offs:
- Strengths: Maximal embodiment, intuitive sensorimotor mapping, unambiguous hand–eye coordination for stationary or brief movements.
- Weaknesses: Susceptibility to cybersickness and disorientation during sustained or abrupt locomotion, reduced performance on extended navigational tasks, coordination friction between host and guest (Zhou et al., 31 Jan 2026).
- Guidelines:
- Use SEV as the default for tasks requiring high embodiment.
- Limit continuous host-driven camera motion to <10 s bursts.
- Trigger perspective-switching cues and recommendations dynamically.
- Combine SEV with perceptual indicators (haptics, visual depth cues) to support alignment without rigid coupling.
- Support flexible role switching and adaptive embodiment strength depending on task phase and user preference.
Collaborative systems (e.g. SPARC) should spatially align user perspectives, apply minimal but sufficient visual distortions, and tune occlusion handling parameters to preserve both social and functional nonverbal communication (Simões et al., 2024).
6. Generalization, Transfer, and Cognitive Implications
Mirror-neuron–inspired and Transformer-fusion architectures facilitate transfer between action understanding and execution, as well as cross-task generalization, by enforcing representational alignment in a shared space (Zhu et al., 25 Sep 2025, Gao et al., 25 Nov 2025). Symbolic approaches reveal that sharing high-level conceptual primitives makes human–AI interaction more transparent and debuggable, aligning internal reasoning traces with intuitive human conceptualizations (Olivier et al., 31 Mar 2025).
Cognitively, reciprocal body-swapping in MR modulates the balance between self–other distinction and social closeness: increased behavioral interdependence and co-presence coincide with decreased spatial action interference (Joint Simon Effect) and reduced ownership, indicating distinct mechanisms for sensorimotor coupling and affective alignment (He et al., 11 Sep 2025). Emotional sharing architectures (e.g., embodied expressive avatars) amplify emotional contagion and collaborative norms, but also introduce individual–group tension (Ohara et al., 26 May 2025).
7. Open Problems and Future Directions
Advancing shared embodied view research entails:
- Holistic, truly multimodal sensorimotor grounding and mutual calibration across agents (Incao et al., 2024).
- Embodiment-agnostic schemas for universal transfer across diverse morphologies.
- Social-cognitive scaffolding of shared experience: joint attention, incremental theory of mind, repair in breakdowns.
- Formal evaluation metrics for alignment of internal models, prediction error minimization across agents, and resilience under communication noise.
- Real-time, scalable rendering and dynamic adjustment for real-world deployment in AR/VR, robotics, and human–AI teams (Wang et al., 24 Jun 2025).
Integration of neural, symbolic, and sensorimotor components is seen as essential for robust, interpretable, and adaptive shared embodiment across human–human and human–AI collectives.