Towards a Japanese Full-duplex Spoken Dialogue System (2506.02979v1)

Published 3 Jun 2025 in cs.CL and eess.AS

Abstract: Full-duplex spoken dialogue systems, which can model simultaneous bidirectional features of human conversations such as speech overlaps and backchannels, have attracted significant attention recently. However, the study of full-duplex spoken dialogue systems for the Japanese language has been limited, and the research on their development in Japanese remains scarce. In this paper, we present the first publicly available full-duplex spoken dialogue model in Japanese, which is built upon Moshi, a full-duplex dialogue model in English. Our model is trained through a two-stage process: pre-training on a large-scale spoken dialogue data in Japanese, followed by fine-tuning on high-quality stereo spoken dialogue data. We further enhance the model's performance by incorporating synthetic dialogue data generated by a multi-stream text-to-speech system. Evaluation experiments demonstrate that the trained model outperforms Japanese baseline models in both naturalness and meaningfulness.

Summary

The paper presents the first publicly available Japanese full-duplex dialogue model, J-Moshi, adapted from the English Moshi system.
It employs vocabulary adaptation, extensive pre-training on 69K hours of pseudo-stereo data, and fine-tuning on 344 hours of real dialogue to capture Japanese conversational traits.
Evaluations show that J-Moshi models outperform Japanese dGSLM baselines in naturalness and meaningfulness, though further enhancements in speech synthesis remain needed.

Full-duplex spoken dialogue systems, which are capable of handling simultaneous speech features common in human conversation like speech overlaps and backchannels, have become a focus of research. While significant progress has been made, particularly with models like Moshi (2410.00037) in English, there has been a lack of research and publicly available systems for other languages, including Japanese. This paper addresses this gap by presenting the first publicly available Japanese full-duplex spoken dialogue model, named J-Moshi, adapted from the English Moshi model.

The core architecture of J-Moshi is based on Moshi, which consists of two main components: Mimi [(2009.02095), 2021 soundstream, 2023 high], a neural audio codec that encodes speech waveforms into discrete audio tokens, and RQ-Transformer (2410.00037), a Transformer model that autoregressively models sequences of text and audio tokens. Mimi processes both the user's input speech and the system's output speech, converting between waveforms and discrete tokens (semantic and acoustic tokens across multiple layers). The RQ-Transformer, based on a LLM, handles the temporal dynamics of the conversation, predicting the next tokens for both the user's and the system's speech and text.

To adapt Moshi to Japanese, the authors implemented a two-stage training process and made key modifications:

Japanese Text Vocabulary Adaptation: The original English SentencePiece tokenizer was replaced with a Japanese SentencePiece model based on a Japanese GPT-2 model [https://huggingface.co/rinna/japanese-gpt2-medium]. Weights in the RQ-Transformer related to text token embeddings were re-initialized to accommodate the new vocabulary.
Pre-training: The model was pre-trained on a large-scale Japanese spoken dialogue corpus, J-CHAT (2407.15828), which contains approximately 69,000 hours of monophonic audio. Since Moshi is designed for stereo input (separate channels for system and user), speaker diarization [Bredin23] was applied to the monophonic data to create pseudo-stereo samples. Transcriptions were generated using an ASR system [https://huggingface.co/reazon-research/reazonspeech-espnet-v2], and WhisperX (2303.00747) was used to obtain token-level timestamps to align text tokens with audio tokens, inserting PAD tokens where necessary. This stage aimed to provide the model with foundational Japanese spoken dialogue capabilities. Mimi's parameters were frozen during this adaptation process, assuming its ability to handle Japanese audio, which was later supported by evaluation results.
Fine-tuning: The model was fine-tuned on 344 hours of high-quality stereo Japanese spoken dialogue data collected from various corpora (Japanese Callhome [LDC96S37], CSJ [maekawa2003corpus], Travel Agency Dialogue Corpus [inaba2024travel], and two in-house corpora). This stage focused on enabling the model to learn natural full-duplex turn-taking patterns present in real conversations.
Data Augmentation: To enhance dialogue capabilities, 602 hours of synthetic stereo spoken dialogue were generated using a multi-stream TTS system, mirroring a technique used in the original Moshi training. Text dialogues from existing Japanese text corpora were rewritten by an LLM [https://huggingface.co/google/gemma-2-27b-it] to better reflect spoken language. The TTS system synthesized speech for both speakers, and samples were filtered based on ASR accuracy. This synthetic data was added to the fine-tuning dataset, resulting in the J-Moshi-ext model.

The performance of J-Moshi and J-Moshi-ext was evaluated using a prompted dialogue continuation task, where the models generated dialogue continuation from a few seconds of prompt audio. Evaluation included both automatic metrics (Perplexity of an external LLM on ASR transcripts) and human evaluation (Naturalness and Meaningfulness on a 5-point scale). Comparisons were made against a Japanese dGSLM [nguyen-etal-2023-generative] baseline trained similarly, as well as Re-synthesis (Mimi resynthesizing ground truth) and Ground-truth audio.

Results showed that J-Moshi and J-Moshi-ext significantly outperformed the dGSLM baseline in both automatic (lower Perplexity) and human evaluations (higher Naturalness and Meaningfulness scores), indicating successful adaptation of the Moshi architecture to Japanese. J-Moshi-ext showed a slight improvement in Meaningfulness over J-Moshi, suggesting the synthetic data was beneficial. However, both models still scored considerably lower than Re-synthesis and Ground-truth in human evaluations, highlighting the need for further improvement in speech generation quality and dialogue coherence. The good performance of Re-synthesis indicated that Mimi could handle Japanese audio reasonably well, supporting the decision to freeze its parameters initially.

A comparative analysis of turn-taking statistics between Japanese J-Moshi and English Moshi revealed that J-Moshi generated dialogues with more Inter-Pausal Units (IPUs) and Overlaps, consistent with known characteristics of Japanese conversations which tend to have more overlaps and backchannels than English [hayashi1988simultaneous, STUBBE1998257]. The authors also noted a higher ratio of PAD tokens in the Japanese training data compared to English, potentially due to language-specific tokenization characteristics, suggesting that optimizing training objectives for such sparsity could be a direction for future work.

In conclusion, this paper successfully developed J-Moshi, the first publicly available full-duplex spoken dialogue model for Japanese. The adaptation process involved text vocabulary replacement, multi-stage training on large-scale monophonic and stereo data, and data augmentation with synthetic speech. J-Moshi demonstrated improved performance over a Japanese baseline and captured Japanese-specific conversational traits. The work provides valuable insights and a baseline for future research and development of full-duplex spoken dialogue systems in Japanese and potentially other languages. The training code, fine-tuned models, and speech samples are publicly available at https://nu-dialogue.github.io/j-moshi.

PDF Markdown

Related Papers

Tweets

https://twitter.com/atsumoto_ohashi/status/1931981505536151918