Papers
Topics
Authors
Recent
2000 character limit reached

Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion

Published 11 May 2024 in cs.SD and eess.AS | (2405.08021v1)

Abstract: Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. T. Schultz, M. Wand, T. Hueber, K. D. J., C. Herff, and J. S. Brumberg, “Biosignal-based spoken communication: A survey,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 25, no. 12, pp. 2257–2271, 2017.
  2. S. Kim, Y.-E. Lee, S.-H. Lee, and S.-W. Lee, “Diff-E: Diffusion-based learning for decoding imagined speech EEG,” in Proc. INTERSPEECH, (Dublin, Ireland), pp. 1159–1163, 2023.
  3. B. Denby, T. Schultz, K. Honda, T. Hueber, J. Gilbert, and J. Brumberg, “Silent speech interfaces,” Speech Communication Journal, vol. 52, no. 4, pp. 270–287, 2010.
  4. L. Maier-Hein, Speech Recognition using Surface Electromyography. Karlsruher Institut für Technologie, 2005. Diplom thesis, 121 pages.
  5. Z. Ren, K. Qian, T. Schultz, and B. W. Schuller, “An overview of the icassp special session on ai security and privacy in speech and audio processing,” in Proc. ACM Multimedia workshop, (Tainan, Taiwan), pp. 1–6, 2023.
  6. L. Diener, M. Janke, and T. Schultz, “Direct conversion from facial myoelectric signals to speech using deep neural networks,” in Proc. IJCNN, (Killarney, Ireland), pp. 1–7, 2015.
  7. M. Wand and T. Schultz, “Session-independent EMG-based Speech Recognition,” in Proc. Biosignals, (Rome, Italy), pp. 295–300, 2011.
  8. T. Schultz and M. Wand, “Modeling coarticulation in EMG-based continuous speech recognition,” Speech Communication, vol. 52, no. 4, pp. 341–353, 2010.
  9. K. Scheck, D. Ivucic, Z. Ren, and T. Schultz, “Stream-ETS: Low-latency end-to-end speech synthesis from electromyography signals,” in Proc. Speech Communication, ITG, (Aachen, Germany), pp. 200–204, 2023.
  10. Z. Ren, K. Scheck, and T. Schultz, “Self-learning and active-learning for electromyography-to-speech conversion,” in Proc. Speech Communication, ITG, (Aachen, Germany), pp. 245–249, 2023.
  11. K. Scheck and T. Schultz, “Multi-speaker speech synthesis from electromyographic signals by soft speech unit prediction,” in Proc. ICASSP, (Rhodos, Greece), pp. 1–5, 2023.
  12. D. Gaddy and D. Klein, “An improved model for voicing silent speech,” in Proc. ACL, (virtual), pp. 175–181, 2021.
  13. M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-TTS: A denoising diffusion model for text-to-speech,” in Proc. INTERSPEECH, (Brno, Czechia), pp. 3605–3609, 2021.
  14. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, (Virtual), pp. 8599–8608, 2021.
  15. H.-Y. Choi, S.-H. Lee, and S.-W. Lee, “Diff-HierVC: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation,” in Proc. INTERSPEECH, (Dublin, Ireland), pp. 2283–2287, 2023.
  16. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. NeurIPS, (Vancouver, Canada), pp. 1–12, 2020.
  17. R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “Prodiff: Progressive fast diffusion model for high-quality text-to-speech,” in Proc. ACM Multimedia, (Lisboa, Portugal), pp. 2595–2605, 2022.
  18. H. Kim, S. Kim, and S. Yoon, “Guided-TTS: A diffusion model for text-to-speech via classifier guidance,” in Proc. ICML, (Hawaii), pp. 11119–11133, 2022.
  19. J. Kong, J. Kim, and J. Bae, “Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, (Vancouver, Canada), pp. 1–12, 2020.
  20. D. Gaddy and D. Klein, “Digital voicing of silent speech,” in Proc. EMNLP, (Virtual), pp. 5521–5530, 2020.
  21. M. Albes, Z. Ren, B. Schuller, and N. Cummins, “Squeeze for sneeze: Compact neural networks for cold and flu recognition,” in Proc. INTERSPEECH, (Shanghai, China), pp. 4546–4550, 2020.
  22. Z. Ren, T. T. Nguyen, Y. Chang, and B. W. Schuller, “Fast yet effective speech emotion recognition with self-distillation,” in Proc. ICASSP, (Rhodes, Greece), 2023. 5 pages.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 1 like about this paper.