J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling (2407.15828v1)
Abstract: Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken LLMs (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-source corpus meeting all these criteria has been available. This study addresses this gap by constructing and releasing a large-scale spoken dialogue corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly accessible. Furthermore, this paper presents a language-independent method for corpus construction and describes experiments on dialogue generation using SLMs trained on J-CHAT. Experimental results indicate that the collected data from multiple domains by our method improve the naturalness and meaningfulness of dialogue generation.
- Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoustical science and technology, 33(6):359–369.
- Seamless: Multilingual expressive and streaming speech translation. arXiv preprint arXiv:2312.05187.
- Soundstorm: Efficient parallel audio generation. Preprint, arXiv:2305.09636.
- GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio. In Proc. Interspeech 2021, pages 3670–3674.
- The Fisher Corpus: a resource for the next generations of speech-to-text. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, Dublin, Ireland. Association for Computational Linguistics.
- HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17022–17033. Curran Associates, Inc.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
- DailyTalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
- Towards human-like spoken dialogue generation between ai agents from written dialogue. Preprint, arXiv:2310.01088.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
- Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, pages 3222–3226.
- Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518.
- Hybrid transformers for music source separation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
- STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent. In Proc. Interspeech 2022, pages 5155–5159.
- Release of pre-trained models for the Japanese language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13898–13905, Torino, Italia. ELRA and ICCL.
- Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5.
- X-Vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333.
- JTubeSpeech: corpus of japanese speech collected from youtube for speech recognition and speaker verification. Preprint, arXiv:2112.09323.
- JSUT and JVS: Free japanese voice corpora for accelerating speech synthesis research. Acoustical Science and Technology, 41(5):761–768.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(86):2579–2605.
- JVNV: A corpus of japanese emotional speech with verbal content and nonverbal expressions. IEEE Access, 12:19752–19764.
- JNV corpus: A corpus of japanese nonverbal vocalizations with diverse phrases and emotions. Speech Communication, 156:103004.
- Wataru Nakata (9 papers)
- Kentaro Seki (9 papers)
- Hitomi Yanaka (30 papers)
- Yuki Saito (47 papers)
- Shinnosuke Takamichi (70 papers)
- Hiroshi Saruwatari (100 papers)