Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion (2405.08021v1)
Abstract: Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
- T. Schultz, M. Wand, T. Hueber, K. D. J., C. Herff, and J. S. Brumberg, “Biosignal-based spoken communication: A survey,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 12, pp. 2257–2271, 2017.
- S. Kim, Y.-E. Lee, S.-H. Lee, and S.-W. Lee, “Diff-E: Diffusion-based learning for decoding imagined speech EEG,” in Proc. INTERSPEECH, (Dublin, Ireland), pp. 1159–1163, 2023.
- B. Denby, T. Schultz, K. Honda, T. Hueber, J. Gilbert, and J. Brumberg, “Silent speech interfaces,” Speech Communication Journal, vol. 52, no. 4, pp. 270–287, 2010.
- L. Maier-Hein, Speech Recognition using Surface Electromyography. Karlsruher Institut für Technologie, 2005. Diplom thesis, 121 pages.
- Z. Ren, K. Qian, T. Schultz, and B. W. Schuller, “An overview of the icassp special session on ai security and privacy in speech and audio processing,” in Proc. ACM Multimedia workshop, (Tainan, Taiwan), pp. 1–6, 2023.
- L. Diener, M. Janke, and T. Schultz, “Direct conversion from facial myoelectric signals to speech using deep neural networks,” in Proc. IJCNN, (Killarney, Ireland), pp. 1–7, 2015.
- M. Wand and T. Schultz, “Session-independent EMG-based Speech Recognition,” in Proc. Biosignals, (Rome, Italy), pp. 295–300, 2011.
- T. Schultz and M. Wand, “Modeling coarticulation in EMG-based continuous speech recognition,” Speech Communication, vol. 52, no. 4, pp. 341–353, 2010.
- K. Scheck, D. Ivucic, Z. Ren, and T. Schultz, “Stream-ETS: Low-latency end-to-end speech synthesis from electromyography signals,” in Proc. Speech Communication, ITG, (Aachen, Germany), pp. 200–204, 2023.
- Z. Ren, K. Scheck, and T. Schultz, “Self-learning and active-learning for electromyography-to-speech conversion,” in Proc. Speech Communication, ITG, (Aachen, Germany), pp. 245–249, 2023.
- K. Scheck and T. Schultz, “Multi-speaker speech synthesis from electromyographic signals by soft speech unit prediction,” in Proc. ICASSP, (Rhodos, Greece), pp. 1–5, 2023.
- D. Gaddy and D. Klein, “An improved model for voicing silent speech,” in Proc. ACL, (virtual), pp. 175–181, 2021.
- M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-TTS: A denoising diffusion model for text-to-speech,” in Proc. INTERSPEECH, (Brno, Czechia), pp. 3605–3609, 2021.
- V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, (Virtual), pp. 8599–8608, 2021.
- H.-Y. Choi, S.-H. Lee, and S.-W. Lee, “Diff-HierVC: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation,” in Proc. INTERSPEECH, (Dublin, Ireland), pp. 2283–2287, 2023.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. NeurIPS, (Vancouver, Canada), pp. 1–12, 2020.
- R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “Prodiff: Progressive fast diffusion model for high-quality text-to-speech,” in Proc. ACM Multimedia, (Lisboa, Portugal), pp. 2595–2605, 2022.
- H. Kim, S. Kim, and S. Yoon, “Guided-TTS: A diffusion model for text-to-speech via classifier guidance,” in Proc. ICML, (Hawaii), pp. 11119–11133, 2022.
- J. Kong, J. Kim, and J. Bae, “Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, (Vancouver, Canada), pp. 1–12, 2020.
- D. Gaddy and D. Klein, “Digital voicing of silent speech,” in Proc. EMNLP, (Virtual), pp. 5521–5530, 2020.
- M. Albes, Z. Ren, B. Schuller, and N. Cummins, “Squeeze for sneeze: Compact neural networks for cold and flu recognition,” in Proc. INTERSPEECH, (Shanghai, China), pp. 4546–4550, 2020.
- Z. Ren, T. T. Nguyen, Y. Chang, and B. W. Schuller, “Fast yet effective speech emotion recognition with self-distillation,” in Proc. ICASSP, (Rhodes, Greece), 2023. 5 pages.