AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering

Published 7 Apr 2026 in cs.CE, cs.AI, cs.CL, cs.CY, and cs.ET | (2604.05591v1)

Abstract: This work introduces a modular platform that brings together six AI services, automatic speech recognition via OpenAI Whisper, multilingual translation through Meta NLLB, speech synthesis using AWS Polly, emotion classification with RoBERTa, dialogue summarisation via flan t5 base samsum, and International Sign (IS) rendering through Google MediaPipe. A corpus of IS gesture recordings was processed to derive hand landmark coordinates, which were subsequently mapped onto three dimensional avatar animations inside a virtual reality (VR) environment. Validation comprised technical benchmarking of each AI component, including comparative assessments of speech synthesis providers and multilingual translation models (NLLB 200 and EuroLLM 1.7B variants). Technical evaluations confirmed the suitability of the platform for real time XR deployment. Speech synthesis benchmarking established that AWS Polly delivers the lowest latency at a competitive price point. The EuroLLM 1.7B Instruct variant attained a higher BLEU score, surpassing NLLB. These findings establish the viability of orchestrating cross modal AI services within XR settings for accessible, multilingual language instruction. The modular design permits independent scaling and adaptation to varied educational contexts, providing a foundation for equitable learning solutions aligned with European Union digital accessibility goals.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates a novel architecture integrating six modular AI microservices to support real-time, multimodal language education for hearing and deaf users.
It achieves high performance with EuroLLM Instruct translation scoring a BLEU of 84.34 and sign language rendering under 300 ms using Whisper and AWS Polly.
The platform’s scalable, service-oriented design in a Unity XR environment on Meta Quest 3 paves the way for future innovations in accessible education.

AI-Driven Modular Multilingual Education in XR: Integrative Accessible Services for Hearing and Deaf Learners

System Overview and Architectural Innovations

The paper "AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering" (2604.05591) presents an extensible, service-oriented architecture composed of six modular AI microservices deployed in a Unity-based XR environment. The platform targets critical accessibility gaps in language education, particularly for deaf and hard-of-hearing users, via bidirectional language interaction among speech, text, and sign modalities.

The architecture leverages AWS infrastructure for scalable, robust deployment, supporting real-time, low-latency interaction for both hearing and deaf participants. Independent implementation of each AI service—automatic speech recognition (ASR), multilingual text translation, text-to-speech (TTS), sentiment analysis, dialogue summarisation, and International Sign (IS) gesture rendering—promotes maintainability, modular upgrade paths, and independent scaling.

Figure 1: High-level overview of the XR platform architecture, depicting AI microservices accessed by 3D avatars and supporting text, speech, and sign communication.

The pipeline from speech input to avatar output integrates Whisper ASR for transcription, Meta NLLB or EuroLLM for translation, AWS Polly for TTS synthesis, a RoBERTa-based sentiment classifier, Flan-T5 for dialogue summarisation, and a MediaPipe/Unity-based IS animation stack. RESTful APIs serve as the primary interface across all modules.

Speech and Translation Modules: Evaluation and Findings

Speech-to-text transcription is powered by Whisper, delivering robust multilingual ASR with real-time throughput. Benchmarking demonstrates high transcription accuracy even in Greek (Figure 2), affirming Whisper's capacity to generalize to diverse phonetic domains in XR contexts.

Figure 2: Whisper ASR demonstrates high-fidelity speech-to-text conversion, illustrated with Greek input.

For text translation, Meta NLLB-200 and EuroLLM 1.7B (Instruct) were comparatively evaluated. The EuroLLM Instruct variant produced a BLEU score of 84.34—outperforming NLLB's 79.25—with faster inference and acceptable resource profiles on consumer GPUs. The base EuroLLM yielded a BLEU of 27.58, demonstrating the necessity of instruction-tuning for high-recall sentence-level tasks.

Figure 3: NLLB-based translation workflow is demonstrated; the EuroLLM Instruct variant offers superior BLEU and inference speed.

AWS Polly was chosen for TTS production, exhibiting sub-100 ms latency at a competitive cost, with superior mean opinion scores (3.5–3.8) and robust service features for multilingual settings.

Sign Language Rendering, Sentiment, and Summarisation in XR

A distinctive contribution is scalable IS rendering on 3D avatars. The authors curated a dataset of 750 IS gesture videos, extracting 21-point 3D hand landmarks via MediaPipe, normalizing per-wrist, and mapping gesture sequences onto Unity-based avatars with sub-300 ms end-to-end latency. This enables dynamic, real-time IS communication integrated with spoken or textual content.

Figure 4: The IS gesture pipeline visualised, including MediaPipe-based landmark extraction and translation to avatar animation.

To enhance avatar social presence, sentiment analysis is mapped to emoticon displays (Figure 5), driven by a RoBERTa-based classifier returning multi-class confidence. The Flan-T5-based summarisation service delivers abstractive dialogue condensation to support session review and assists both hearing and deaf users.

Figure 5: Emoticon-based feedback linked to RoBERTa sentiment analysis enhances XR avatar expressiveness.

XR Learning Environment: User Experience Synthesis

The immersive classroom is implemented with Unity and deployed on Meta Quest 3, supporting multimodal interaction. Avatars blend lip-sync (for TTS) with IS gesture sequences (for sign output) and emotive cues, creating a unified communicative interface. The system supports bidirectional language and sign exchange: speech and text inputs are transcribed, translated, and delivered as both spoken and signed outputs (Figure 6).

Figure 7: Immersive classroom in XR, visualizing a 3D avatar delivering cross-modal content.

Figure 6: MVP demonstration of multilingual, multimodal avatar interaction, with signed, spoken, and text translations plus sentiment feedback.

Numerical Performance, Benchmarking, and Scalability

Key reported metrics include:

AWS Polly TTS: Consistent first-byte latency of 50–100 ms; MOS up to 3.8.
Translation BLEU: EuroLLM Instruct variant at 84.34 vs. NLLB at 79.25.
IS Animation Latency: $<$ 300 ms per sign, enabling natural conversational pacing.
Scalability: 1,000 concurrent simulated users yielded average API response times under 800 ms with no critical service failures.

These findings establish that the platform achieves the stringent latency and throughput demands of real-time XR applications.

Theoretical Implications and Future Directions

The integration of multimodal (text, speech, sign) AI microservices within a single scalable XR architecture directly addresses the persistent lack of accessibility for deaf and hard-of-hearing users in language learning systems. Use of International Sign enables cross-border, cross-linguistic communication, aligning with European policy mandates on digital inclusion and accessibility.

The demonstration that instruction-tuned generalist LLMs (EuroLLM Instruct) can surpass purpose-built translation models (NLLB) on both quality and inference speed for European language pairs motivates further exploration of LLM transfer and fusion for low-resource multimodal translation. The architectural modularity enables rapid updating as stronger models emerge.

The current limitation remains the scope of the IS vocabulary (750 gestures), absence of non-manual marker synthesis, and lack of published user studies or formal usability evaluation. Future work should expand the IS corpus, develop avatar facial/action units for nuanced sign, and evaluate learning outcomes with standardised instruments. Multi-user interaction and formal educational pilots represent critical next steps.

Conclusion

This research provides a compelling case for service-oriented, modular AI integration of speech, translation, and sign language in immersive XR environments, validated by strong latency benchmarks and translation quality. The platform substantially advances the technical and practical feasibility of accessible, multilingual education for both hearing and deaf users via real-time, cross-modal avatars. Anticipated future work includes expansion of gesture corpora, richer avatar expressivity, formal efficacy assessment, and deployment across diverse learning and business contexts.

Markdown Report Issue