Multisensory Extended Reality (XR)
- Multisensory XR is a technology that integrates visual, auditory, haptic, and physiological signals to create immersive digital environments across VR, AR, and MR.
- It employs advanced sensor fusion and low-latency real-time architectures, using methods like Bayesian inference and Kalman filtering to synchronize diverse sensory inputs.
- Multisensory XR drives innovations in clinical, industrial, and social domains by delivering adaptive feedback and collaborative environments that enhance cognitive and perceptual tasks.
Multisensory Extended Reality (XR) encompasses immersive digital environments and interfaces that synthesize and coordinate multiple sensory channels—visual, auditory, haptic, and, increasingly, physiological and neural signals—to create adaptive, interactive experiences that more accurately reflect and augment real-world perception and action. This field covers the full Reality–Virtuality continuum, including Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR), and extends towards shared, collaborative multisensory environments (e.g., XV) that integrate real, virtual, and social domains. Multisensory XR systems enable richer evaluation of cognitive, social, and sensorimotor states, facilitate biomedical and industrial applications, and demand advanced multimodal signal processing, sensor fusion, streaming, and feedback architectures.
1. Theoretical Foundations and Taxonomies
Extended Reality (XR) is formally defined as the set-theoretic union of all “X-Realities”: XR = ∪_{X ∈ ℝeality} X-Reality, where X ranges over physical, virtual, augmented, and other modalities (Mann et al., 2022). XR technologies traverse Milgram and Kishino’s Reality–Virtuality continuum, where VR (α=0, β=1), AR (α>0, β>0), and DR occupy distinct but connected loci in the α (“atoms”/physical) and β (“bits”/virtual) plane.
The XV (“eXtended meta/omni/uni/Verse”) framework further introduces a third axis γ (“genes”/sociality), positing XV as a three-dimensional taxonomy (α, β, γ) encoding physicality, virtuality, and social collaboration. This construct permits precise characterization of shared/multisensory systems, including those that extend human perception to otherwise imperceptible phenomena (infrared, RF, ultrasound) and enable collaborative overlays in industrial and medical contexts (Mann et al., 2022).
Multisensory XR explicitly models and exploits the integration of discrete sensory modalities. In the clinical and human-factors literature, multisensory integration is often abstracted as a weighted sum over sensory channel signals,
where v,a,h,p denote visual, auditory, haptic, and proprioceptive channels, and w_m are individualized weights (Bauer et al., 2021).
2. Sensory Modalities, Devices, and Measured Signals
Multisensory XR systems incorporate parallel hardware and software for simultaneous acquisition, rendering, and closed-loop adaptation of diverse sensory signals (González-Erena et al., 14 Jan 2025, Wang et al., 27 Mar 2025, Xu et al., 14 Nov 2025, Krieger et al., 2023). Primary modalities include:
- Visual: High-resolution stereoscopic HMDs (e.g., Apple Vision Pro, HTC Vive Pro 2), projectors, wide-FOV AR displays; metrics include frame rate, resolution, and luminance (Wang et al., 27 Mar 2025).
- Auditory: Binaural/spatial audio, real-time scene-aware rendering (see Section 3), head tracking for dynamic spatialization.
- Haptic: Wearable vibrotactile actuators, exoskeletons for force feedback, sensorized gloves (e.g., SenseGlove Nova) for precision object manipulation (Krieger et al., 2023).
- Proprioceptive and Body Tracking: 6-DoF tracking with optical or inertial systems, full-body kinematics (González-Erena et al., 14 Jan 2025, Krieger et al., 2023).
- Physiological/Neural: EEG (0.5–100 Hz), galvanic skin response (GSR, 10–100 Hz), eye tracking (30–1200 Hz), measuring cognitive load, attention, arousal (González-Erena et al., 14 Jan 2025).
Emerging modalities, such as olfactory, gustatory, and temperature feedback, are identified as future extensions to deepen immersion (Wang et al., 27 Mar 2025, Xu et al., 14 Nov 2025).
3. Signal Processing, Multimodal Fusion, and Real-Time Architectures
Multisensory XR workflows require robust sensor fusion, low-latency synchronization, and adaptive feedback architectures. Key methodologies include:
- Fusion Strategies: Early fusion (feature concatenation), late fusion (classifier output aggregation), Bayesian inference (P(C | O) ∝ P(C)∏ᵢ P(oᵢ|C)), Kalman filtering for continuous state estimation, weighted averaging with confidence-based adaptation, and methods designed to accommodate heterogeneous latency and sampling rates (González-Erena et al., 14 Jan 2025).
- Real-Time System Architecture: Typical pipelines comprise:
- Sensor/Actuator Layer: Synchronized hardware for multi-modal input/output.
- Acquisition/Preprocessing: Time-stamping, filtering, artifact rejection, normalization.
- Feature Extraction: Modality-specific signal processing (EEG band powers, GSR SCRs, eye-tracking saccades, body kinematics).
- Multimodal Fusion/State Estimation: Bayesian or Kalman-based engines yield estimates of attention, cognitive load, or engagement.
- XR Engine/Task Adaptation: Real-time adjustment of rendered content based on fused user state.
- Logging/Feedback: Storage for offline analysis, summary of physiological and behavioral metrics (González-Erena et al., 14 Jan 2025).
For streaming, fundamental data/latency requirements are dictated by the end-to-end thresholds of relevant perceptual reflexes (e.g., τ_end–to–end ≤ 7 ms for vestibulo-ocular stability), with bandwidth requirements for full-FOV, high PPD streaming on the order of Tbps in naive uncompressed scenarios (Wang et al., 27 Mar 2025).
4. Advanced Multisensory Rendering: Auditory, Haptic, and Crossmodal
Auditory XR: The SAMOSA system employs multimodal scene representation—integrating room geometry (shoebox estimation from SLAM), material segmentation (MobileNetV2/DeepLabv3+ pipeline on RGB), and semantic acoustic context detection—to synthesize physically and perceptually plausible room impulse responses (RIRs) for real-time, scene-aware auditory rendering (Xu et al., 14 Nov 2025). The RIR comprises direct sound, modeled early reflections (via Image Source Method), and a hybrid late reverberation modeled by Eyring’s equation with learned perceptual correction: An advantage of SAMOSA is its on-device efficiency: <3.5 MB runtime footprint, <3% CPU, and statistically significant improvements in perceived naturalness and externalization (p < 0.05) over non-adaptive or geometry-only baselines (N=12 expert user study).
Audio-Visual Source Separation: MoXaRt builds on cascaded, visually guided sound separation for object-centric auditory interaction (Xu et al., 11 Mar 2026). Coarse audio-only separation is refined by object detection on the video stream (YOLOv8-Face, DeepLabv3+), with teacher-student distillation used to align separated stems to objects (faces/instruments). This architecture enables real-time, multi-object separation with significant increases in listening comprehension (36.2% gain, p<0.01) and subjective metrics (113% higher ratings for clarity, absence of interference, and immersion; N=22), at ~0.8–2 s processing latency, approaching interactive thresholds.
Haptic XR: Intuitive stereoptic and haptic exploration (ISH3DE framework) for volumetric biomedical imaging couples VR headsets (HTC Vive Pro 2), sensorized gloves (SenseGlove Nova), and physically based force-feedback solvers to allow direct manual manipulation and palpation of organ models (Krieger et al., 2023). A stiffness-based model computes per-fingertip force
where δ(x) is the penetration depth and n(x) the local normal. Usability studies (N=24) showed that hands-based haptic XR is seen as more intuitive and beneficial for data exploration, with qualitative feedback emphasizing improved perception and engagement.
5. Multisensory XR in Cognitive, Clinical, and Social Contexts
XR enables closed-loop assessment and training across memory, executive function, attention, spatial cognition, and social-emotional skills (González-Erena et al., 14 Jan 2025, Bauer et al., 2021). Key elements include:
Real-time physiological and behavioral state monitoring (GSR, EEG, ET, hand/body tracking).
- Adaptive task engines that adjust content or difficulty in response to detected user state, maximizing engagement and ecological validity (virtual tasks, context-sensitive prompts, AR overlays).
- Clinical applications and trials report measurable improvements in memory recall (20–30%), attentional lapses (15–25%), and social confidence/engagement in target populations (MCI, ASD, ADHD), compared to traditional or single-modal interventions.
- Autism interventions leverage multisensory environments for mediation, sensory habituation, and agency, with practitioner guidelines emphasizing individualization of sensory profiles, structured predictability, clear presentation, and support for collaborative/co-present interaction. Quantitative and qualitative measures (e.g., test gains, anxiety index, behavioral logs, structured observations) inform outcome scoring (Bauer et al., 2021).
6. Challenges, Technical Limitations, and Evaluation Metrics
Significant open challenges persist:
- Latency and Bandwidth: Maintaining end-to-end system delays below perceptual thresholds (<7 ms for critical sensorimotor feedback) at extreme bandwidths for full-resolution, multi-modal content (Wang et al., 27 Mar 2025).
- Sensor Integration and Calibration: Heterogeneous sampling rates, synchronization, and noise/artifact control across modalities (EEG, GSR, eye tracking) are nontrivial, requiring robust preprocessing and calibration routines (González-Erena et al., 14 Jan 2025).
- User Comfort and Accessibility: Cybersickness remains a barrier (SSQ, VRSQ scoring), particularly in cases of visual–vestibular mismatch or suboptimal haptic feedback. Usability scales (SUS, NASA-TLX), presence inventories (PQ, IPQ), and ecological validity correlations are routinely employed for quantification (González-Erena et al., 14 Jan 2025, Krieger et al., 2023).
- Manual Interaction Fidelity: Current haptic device limitations (glove jitter, excessive resistance, incomplete kinematic calibration) are the primary bottleneck in tactile XR, although users strongly prefer direct, intuitive grasp to controller operation (Krieger et al., 2023).
- Multimodal Fusion and Underutilization: Most commercial XR platforms under-leverage full multisensory integration; opportunities exist for richer closed-loop adaptation and cross-modal feedback (González-Erena et al., 14 Jan 2025, Xu et al., 14 Nov 2025).
7. Future Directions and Prospects
Research trajectories in multisensory XR are converging on several fronts:
- Richer Sensor and Feedback Integration: Incorporating full rings of modalities (visual, acoustic, haptic, olfactory, physiological) for truly multisensory, adaptive interaction and communication, potentially structured as unified architectures parameterizable by user intent and context (Xu et al., 14 Nov 2025, González-Erena et al., 14 Jan 2025).
- Edge and Cloud-Assisted Computing: Leveraging 5G/6G, network slicing, and real-time offloading for scalable low-latency, high-bandwidth XR streaming and collaborative “XV” environments (Wang et al., 27 Mar 2025, Mann et al., 2022).
- Standardization and Protocols: Defining protocol stacks and APIs for cross-device, cross-modal streaming, and authoring; integrating AI-driven content generation, metadata tagging, and quality assessment (GANs, LLMs) (Wang et al., 27 Mar 2025).
- Clinical and Educational Validation: Large-scale, multi-center randomized controlled trials to establish efficacy, normative data, and standard protocols for diagnosis, rehabilitation, and training (González-Erena et al., 14 Jan 2025).
- Scalability and Ethics in Social XR: Addressing real-time collaborative overlays (γ>0 in XV), privacy, consent, and the social implications of filtering or augmenting shared sensory experiences (Mann et al., 2022, Xu et al., 11 Mar 2026).
Multisensory XR defines a rapidly advancing intersection of perceptual augmentation, real-time multimodal computation, and human-centric adaptation. It underpins emerging applications in medicine, education, social interaction, industrial inspection, and cognitive science, while presenting a technical and theoretical agenda centered on full-spectrum, user-aware, adaptive multisensory immersion.