INTERACT: An AI-Driven Extended Reality Framework for Accesible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition

Published 7 Apr 2026 in cs.CE, cs.AI, cs.CL, cs.CV, and cs.ET | (2604.05605v1)

Abstract: Video conferencing has become central to professional collaboration, yet most platforms offer limited support for deaf, hard-of-hearing, and multilingual users. The World Health Organisation estimates that over 430 million people worldwide require rehabilitation for disabling hearing loss, a figure projected to exceed 700 million by 2050. Conventional accessibility measures remain constrained by high costs, limited availability, and logistical barriers, while Extended Reality (XR) technologies open new possibilities for immersive and inclusive communication. This paper presents INTERACT (Inclusive Networking for Translation and Embodied Real-Time Augmented Communication Tool), an AI-driven XR platform that integrates real-time speech-to-text conversion, International Sign Language (ISL) rendering through 3D avatars, multilingual translation, and emotion recognition within an immersive virtual environment. Built on the CORTEX2 framework and deployed on Meta Quest 3 headsets, INTERACT combines Whisper for speech recognition, NLLB for multilingual translation, RoBERTa for emotion classification, and Google MediaPipe for gesture extraction. Pilot evaluations were conducted in two phases, first with technical experts from academia and industry, and subsequently with members of the deaf community. The trials reported 92% user satisfaction, transcription accuracy above 85%, and 90% emotion-detection precision, with a mean overall experience rating of 4.6 out of 5.0 and 90% of participants willing to take part in further testing. The results highlight strong potential for advancing accessibility across educational, cultural, and professional settings. An extended version of this work, including full pilot data and implementation details, has been published as an Open Research Europe article [Tantaroudas et al., 2026a].

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a comprehensive AI framework integrating Whisper ASR (mean latency 1.2 s, >85% accuracy), NLLB translation, and real-time 3D ISL avatar rendering.
It employs a modular architecture with scalable AI modules deployed on Meta Quest 3, sustaining up to 1,000 users and handling 10,000 concurrent requests in stress tests.
User evaluations reported a 92% satisfaction rate and high technical performance, demonstrating effective and inclusive communication for diverse settings.

INTERACT: An AI-Driven Extended Reality Framework for Accessible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition

Introduction and Motivation

The paper "INTERACT: An AI-Driven Extended Reality Framework for Accessible Communication Featuring Real-Time Sign Language Interpretation and Emotion Recognition" (2604.05605) systematically addresses accessibility deficits persisting in contemporary video conferencing platforms, particularly for deaf, hard-of-hearing, and multilingual users. Conventional solutions rely heavily on costly human interpreters and offer only piecemeal technological support, frequently lacking cohesive, real-time integration of speech-to-text, sign language, translation, and affective cues within a unified environment.

INTERACT is introduced as an end-to-end, AI-driven XR architecture, designed to collapse these silos and deliver comprehensive accessibility. By integrating modules for ASR (Whisper), multilingual NMT (NLLB), ISL avatar rendering, and emotion analysis (DistilRoBERTa), deployed on hardware such as Meta Quest 3 headsets, the platform aims to set a technical and practical benchmark for immersive, inclusive communication systems.

Figure 1: INTERACT system architecture showing modular integration across CORTEX2 and AI service components from audio input to XR output.

System Architecture and Technical Realization

The INTERACT system is architected within the CORTEX2 framework, utilizing modular, independently scalable AI services for each core accessibility channel. The pipeline comprises:

Speech-to-Text Conversion: Chunked real-time ASR utilizing Whisper Large-v2, implemented with streaming via WebSockets and optimized overlap-window chunking for transcript latency and resilience (mean latency 1.2 s, accuracy >85%).
Figure 2: Speech-to-Text processing pipeline using chunking, Whisper inference, and text assembly to yield live transcriptions.
Multilingual Translation: The NLLB distilled model (600M parameters) performs language detection and translation, achieving mean BLEU-aligned quality with an 0.8 s mean latency for English–French, engineered for extensibility to further TMs.
Figure 3: Multilingual translation workflow from source recognition to NLLB inference and distribution of translated sequence to downstream modules.
ISL Gesture Extraction and 3D Avatar Animation: ISL video corpora are processed via MediaPipe Holistic for pose estimation, extracting 3D skeletal trajectories, which are then mapped to Unity avatar rigs for signing synthesis (vocabulary coverage: 750 signs). Avatar behavior is evaluated in terms of spatial fidelity and communicative efficiency.
Figure 4: ISL gesture extraction pipeline showing the transformation from source video frames to 3D trajectory for avatar animation.
Real-Time Avatar Rendering: 3D ISL avatars are visualized within the XR meeting environment, tracked from multiple perspectives to ensure sign intelligibility and alignment with the communicative scene.
Figure 5: 3D avatar performing ISL signs from multiple angles within the XR environment, demonstrating practical deployment for sign language dissemination.
Emotion Analysis: The emotion pipeline applies DistilRoBERTa to transcribed text, classifying utterances into six emotional states, with outputs affecting avatar facial rigs and optional emoji overlays, realized under 200 ms latency with >90% precision.
Figure 6: Emotion detection integration where textual transcriptions are annotated for affect and reflected in avatar facial modulation.
Summarisation: Accumulated transcripts are processed with a fine-tuned BART-Large on the SAMSum dataset, producing structured meeting records (decisions, action items) accessible asynchronously or for post-event review.
Figure 7: Meeting summarisation: BART-driven process yielding concise highlights and key points for archival and accessibility purposes.
XR Environment and Conferencing Integration: The Unity/Meta XR stack supports ergonomic placement of avatars and content overlays, integrated with the Rainbow SDK for audio and participant management.
Figure 8: Full virtual office setting illustrating participant spatialization and signing avatar integration for maximal accessibility.

Figure 9: SDK and AI module integration within Unity, visualizing data flow among conferencing, AI, and rendering subsystems.
End-to-End Communication: Real-time ISL avatar animation, synchronized with emotion detection, is rendered as participants communicate, leveraging the streaming and modular architecture.
Figure 10: VR scene demonstrating ISL avatar animation and emotion feedback in response to live spoken interactions.

Evaluation and Empirical Results

A two-stage pilot evaluation protocol was adopted: technical experts assessed system reliability and modifiability; deaf community members evaluated linguistic accuracy, avatar expressiveness, and accessibility value. Notably:

User Satisfaction: Achieved 92% satisfaction rate; mean experience score 4.6/5.0.
Transcription: Consistently above 85% accuracy in controlled conversational settings.
Emotion Recognition: Validated at 90%+ precision against annotator consensus for principal categories.
Scalability: System sustained up to 1,000 simultaneous users without latency inflation; stress testing demonstrated 10,000 concurrent requests at 900–1,000 requests/s with zero failures.
Figure 11: Load testing graphs confirming robust scalability under increasing user demand with negligible errors or latency rise.

Participant feedback strongly emphasized the requirement for further enhancement of avatar facial expression for authentic ISL grammatical and affective communication, as well as expedited expansion of language pairs and vocabulary customizability.

Limitations

Primary limitations concern ISL vocabulary coverage (currently 750 items) and insufficient facial expression integration in avatar signing. The pipeline’s fallback on fingerspelling for out-of-vocabulary terms and limited support for languages outside English–French reduce the system’s universality. Evaluations, while methodologically rigorous, were based on modest sample sizes, warranting expanded future studies.

Implications and Prospective Developments

The INTERACT framework establishes a practical pathway toward genuinely accessible XR communication for sensory and linguistic minorities in professional and educational domains. Practically, the platform enables scalable deployment across a spectrum of collaborative contexts, contingent on further extension of sign language corpora (including advanced facial/grin markers), incremental addition of TM directions, and broader integration with conferencing platforms.

Theoretical implications underscore the benefit of modular, interface-driven AI pipeline architectures for accessibility: each component can be replaced or extended as new SOTA models emerge (e.g., neural sign generation supplanting dictionary- or template-based methods). End-user involvement is demonstrated to be central for evaluating not only performance but authenticity and communicative success.

Development trajectories will likely include adoption of S2S neural models tailored for ISL sequence generation, dynamic adaptation of signing pace to context/user preference, incorporation of lip-sync and ISL grammatical features, and multimodal input/output channels. The architecture offers a testbed for real-world evaluation of affective computing and ASR/NMT paradigms in multi-agent, immersive settings.

Conclusion

INTERACT provides a rigorously validated, modular, and scalable blueprint for XR-based, AI-driven accessible communication platforms, substantiated by high user satisfaction and technical performance metrics. Its open-source design, strong empirical validation with diverse stakeholders, and extensible architecture position it as a reference for future applied research and large-scale deployment in accessible XR communication systems (2604.05605).

Markdown Report Issue