Style Modeling for Multi-Speaker Articulation-to-Speech
Abstract: In this paper, we propose a neural articulation-to-speech (ATS) framework that synthesizes high-quality speech from articulatory signal in a multi-speaker situation. Most conventional ATS approaches only focus on modeling contextual information of speech from a single speaker's articulatory features. To explicitly represent each speaker's speaking style as well as the contextual information, our proposed model estimates style embeddings, guided from the essential speech style attributes such as pitch and energy. We adopt convolutional layers and transformer-based attention layers for our model to fully utilize both local and global information of articulatory signals, measured by electromagnetic articulography (EMA). Our model significantly improves the quality of synthesized speech compared to the baseline in terms of objective and subjective measurements in the Haskins dataset.
- Y.-W. Chen, K.-H. Hung, S.-Y. Chuang, J. Sherman, W.-C. Huang, X. Lu, and Y. Tsao, “Ema2s: An end-to-end multimodal articulatory-to-speech system,” in 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5.
- G. A. Gates, W. Ryan, J. Cooper Jr, G. F. Lawlis, E. Cantu, E. Lauder, R. W. Welch, and E. Hearne, “Current status of laryngectomee rehabilitation: I. results of therapy,” American journal of otolaryngology, vol. 3, no. 1, pp. 1–7, 1982.
- B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg, “Silent speech interfaces,” Speech Communication, vol. 52, no. 4, pp. 270–287, 2010.
- G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis from neural decoding of spoken sentences,” Nature, vol. 568, no. 7753, pp. 493–498, 2019.
- S. A. Kauffman, “Articulation of parts explanation in biology and the rational search for them,” in Topics in the Philosophy of Biology. Springer, 1976, pp. 245–263.
- L. E. Volaitis and J. L. Miller, “Phonetic prototypes: Influence of place of articulation and speaking rate on the internal structure of voicing categories,” The Journal of the Acoustical Society of America, vol. 92, no. 2, pp. 723–735, 1992.
- T. Toda, A. W. Black, and K. Tokuda, “Mapping from articulatory movements to vocal tract spectrum with gaussian mixture model for articulatory speech synthesis,” in Fifth ISCA Workshop on Speech Synthesis, 2004.
- S. Aryal and R. Gutierrez-Osuna, “Articulatory inversion and synthesis: towards articulatory-based modification of speech,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7952–7956.
- B. Picart, T. Drugman, and T. Dutoit, “Continuous control of the degree of articulation in hmm-based speech synthesis,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
- D. Gaddy and D. Klein, “An improved model for voicing silent speech,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 175–181. [Online]. Available: https://aclanthology.org/2021.acl-short.23
- J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
- B. Cao, A. Wisler, and J. Wang, “Speaker adaptation on articulation and acoustics for articulation-to-speech synthesis,” Sensors, vol. 22, no. 16, p. 6056, 2022.
- B. Cao, M. J. Kim, J. R. Wang, J. P. van Santen, T. Mau, and J. Wang, “Articulation-to-speech synthesis using articulatory flesh point sensors’ orientation information.” in INTERSPEECH, 2018, pp. 3152–3156.
- F. Taguchi and T. Kaburagi, “Articulatory-to-speech conversion using bi-directional long short-term memory.” in Interspeech, 2018, pp. 2499–2503.
- T. G. Csapó, C. Zainkó, L. Tóth, G. Gosztolya, and A. Markó, “Ultrasound-based articulatory-to-acoustic mapping with waveglow speech synthesis,” arXiv preprint arXiv:2008.03152, 2020.
- K. Katsurada and K. Richmond, “Speaker-Independent Mel-Cepstrum Estimation from Articulator Movements Using D-Vector Input,” in Proc. Interspeech 2020, 2020, pp. 3176–3180.
- Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
- Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 5180–5189.
- D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 7748–7759.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- T. Sainburg, M. Thielk, and T. Q. Gentner, “Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires,” PLoS computational biology, vol. 16, no. 10, p. e1008228, 2020.
- K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
- N. U. Rani and P. Girija, “Error analysis to improve the speech recognition accuracy on telugu language,” Sadhana, vol. 37, no. 6, pp. 747–761, 2012.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.