Language-free Experience at Expo 2025 Osaka

Published 1 May 2026 in cs.CL | (2605.00373v1)

Abstract: In line with the Global Communication Plan 2025, we have pursued the development of multilingual translation technologies to realize a language-barrier-free experience at Expo 2025 Osaka. Our work includes the advancement of simultaneous interpretation systems emphasizing high translation quality and low latency. Key achievements include chunk-based input segmentation, context-aware translation, and multi-engine machine translation technologies. Through demonstration deployments and collaboration with private companies, our technologies have led to real-world applications, with several services and systems showcased at Expo 2025 Osaka.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a multilingual simultaneous interpretation system that integrates adaptive neural segmentation, context tagging, and multi-engine translation for low-latency performance.
The system achieves notable BLEU score gains and reduces chunk length latency by up to 60% across multiple languages via optimized context-aware processing.
The deployment at Expo 2025 demonstrates practical real-world applications, including real-time subtitles, avatar-based booths, and multilingual guided tours.

Language-Free Communication at Expo 2025 Osaka: System Overview and Evaluation

System Architecture and Core Technologies

The described multilingual simultaneous interpretation system, developed at NICT, targets the elimination of language barriers at Expo 2025 Osaka via speech-to-speech and speech-to-text translation covering 15 languages. Its architecture is optimized for high translation quality and low latency and integrates several advanced components:

Automatic Speech Recognition (ASR): Converts input speech to text in real time.
Online Input Segmentation: Segments ASR output into semantic chunks, leveraging recursive neural architectures that operate online and adaptively set the lookahead window (parameter $N$ ) via development set optimization.
Context-Aware Neural Machine Translation (NMT): Implements both chunk-based and sentence-based translation, supported by context tagging and adaptation layers; context tags encode dialogue metadata such as scene, speaker, subject, and gender. Extended context modeling draws from prior utterance history to disambiguate and improve translation.
Multi-Engine MT Framework: Simultaneously deploys multiple translation engines—general-purpose, domain-adapted, universal NMT, and an LLM-based engine (RWKV). Online engine selection utilizes back-translation and cosine similarity for automatic output selection.
Parallel Sentence/Chunk Translation: Low-latency chunk-based outputs are promptly delivered, while full-sentence NMT provides retrospective high-accuracy retranslation when chunk and sentence boundaries align.

Language Resources and Corpus Construction

The underlying corpora are central to system performance, encompassing:

Basic Multilingual Conversation Corpus: Simulated multi-domain dialog (e.g., medical, disaster prevention, tourism), initially authored in Japanese and human-translated to other target languages, with speaker and domain annotations for model adaptation.
Simultaneous Interpretation Corpus: Constructed via semantic chunking of original speech/conversation transcripts by human interpreters, concatenated into sequential chunk translations. Directions include Japanese-to-other (14 targets) and several other-to-Japanese.

This design allows for both high-coverage bilingual data and alignment with the peculiarities of simultaneous interpretation: short, semantically complete chunks instead of full sentences, reflecting naturally occurring word order divergences.

Input Segmentation and Contextual Adaptation

A notable technical advancement is the end-to-end segmentation-driven translation pipeline. Recursive neural networks trained on the interpretation corpus deliver adaptive segmentation, considering both immediate and future tokens based on language-specific optimization. This online segmentation closes the gap between the unstructured ASR output and the requirements for effective NMT.

Context-aware NMT leverages both intra- and extra-sentential information by formulating context tags and passing conversation history. Integration of scene and speaker tags demonstrably improves the alignment between system outputs and conversational intent, especially in ambiguous cases.

Multi-Engine Selection and Evaluation

The simultaneous interpretation system implements a multi-engine translation infrastructure, wherein outputs from GPMT, TSEG, UNIV, and LLMs (RWKV) are ranked using an online back-translation metric. The best output per utterance is selected via cosine similarity between the source and back-translated vectors, automating engine selection without direct human intervention.

BLEU-based evaluations demonstrate that the TSEG-S engine (sentence segment, contextual) outperforms both fixed-length segmentation (BASE) and context-free NMT (GPMT), with average BLEU gains of 3.03 points over GPMT and 10.84 points over BASE, especially in Asian languages (e.g., Khmer +6.78). While chunk-based translation (TSEG-C) incurs an average BLEU penalty of 3.68 points relative to sentence-level TSEG-S, the latency reduction is substantial, with chunk lengths averaging 39% shorter than sentence-level segments for Japanese and 20–60% reductions across languages.

Deployment and Usability

System components underpin real-world applications deployed at Expo 2025, including "EXPO Honyaku" (multilingual chat), "EXPO Honyaku Remote" (multilingual guided tours), real-time seminar subtitles, and immersive avatar-based interpretation booths. These use cases demonstrate both scalability and cross-modal integration (speech-to-speech, speech-to-text, avatar interfaces).

Chunk-based translation with sentence-level retranslation achieves a pragmatic balance between real-time responsiveness and semantic fidelity, minimizing user-perceived delays while maintaining interpretive quality.

Implications and Future Prospects

Practically, this multilingual simultaneous interpretation platform represents a significant step toward ubiquitous, high-quality machine-mediated cross-linguistic communication in dynamic, real-world environments characterized by fast turn-taking and topic shifts.

Theoretically, the results substantiate the efficacy of context modeling, neural segmentation, and multi-engine frameworks for low-latency translation—a promising direction for future advances in real-time MT and dialogue systems. The architecture is amenable to further improvements via large-scale LLMs, advanced context tracking (dialogue history), and adaptation to additional languages.

Expected future work includes:

Scaling to broader language coverage and diverse modal inputs
Integration with multi-modal AI agents (vision, audio-visual cues)
Enhanced personalized adaptation using richer user and context profiles
Tighter fusion with LLMs for generalization and flexibility in open-domain conversation

Conclusion

The NICT system for Expo 2025 presents a robust, extensible solution to real-time, language-independent communication via simultaneous interpretation. Its modular design, grounded in neural segmentation, context adaptation, and multi-engine MT, achieves notable BLEU improvements and practical latency gains across multiple language pairs. These outcomes have direct implications for deployments in multilingual events, digital public services, and future society showcases, establishing a foundation for further evolution of AI-driven multilingual communication systems (2605.00373).

Markdown Report Issue