Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling (2407.15828v1)

Published 22 Jul 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken LLMs (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-source corpus meeting all these criteria has been available. This study addresses this gap by constructing and releasing a large-scale spoken dialogue corpus, named Japanese Corpus for Human-AI Talks (J-CHAT), which is publicly accessible. Furthermore, this paper presents a language-independent method for corpus construction and describes experiments on dialogue generation using SLMs trained on J-CHAT. Experimental results indicate that the collected data from multiple domains by our method improve the naturalness and meaningfulness of dialogue generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Naturalistic emotional speech collection paradigm with online game and its psychological and acoustical assessment. Acoustical science and technology, 33(6):359–369.
  2. Seamless: Multilingual expressive and streaming speech translation. arXiv preprint arXiv:2312.05187.
  3. Soundstorm: Efficient parallel audio generation. Preprint, arXiv:2305.09636.
  4. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio. In Proc. Interspeech 2021, pages 3670–3674.
  5. The Fisher Corpus: a resource for the next generations of speech-to-text. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  8. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, Dublin, Ireland. Association for Computational Linguistics.
  9. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17022–17033. Curran Associates, Inc.
  10. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  11. DailyTalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  12. Towards human-like spoken dialogue generation between ai agents from written dialogue. Preprint, arXiv:2310.01088.
  13. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
  14. Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, pages 3222–3226.
  15. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518.
  16. Hybrid transformers for music source separation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  17. STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent. In Proc. Interspeech 2022, pages 5155–5159.
  18. Release of pre-trained models for the Japanese language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13898–13905, Torino, Italia. ELRA and ICCL.
  19. Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5.
  20. X-Vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333.
  21. JTubeSpeech: corpus of japanese speech collected from youtube for speech recognition and speaker verification. Preprint, arXiv:2112.09323.
  22. JSUT and JVS: Free japanese voice corpora for accelerating speech synthesis research. Acoustical Science and Technology, 41(5):761–768.
  23. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(86):2579–2605.
  24. JVNV: A corpus of japanese emotional speech with verbal content and nonverbal expressions. IEEE Access, 12:19752–19764.
  25. JNV corpus: A corpus of japanese nonverbal vocalizations with diverse phrases and emotions. Speech Communication, 156:103004.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wataru Nakata (9 papers)
  2. Kentaro Seki (9 papers)
  3. Hitomi Yanaka (30 papers)
  4. Yuki Saito (47 papers)
  5. Shinnosuke Takamichi (70 papers)
  6. Hiroshi Saruwatari (100 papers)

Summary

We haven't generated a summary for this paper yet.