Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recovering implicit pitch contours from formants in whispered speech (2307.03168v1)

Published 6 Jul 2023 in eess.AS

Abstract: Whispered speech is characterised by a noise-like excitation that results in the lack of fundamental frequency. Considering that prosodic phenomena such as intonation are perceived through f0 variation, the perception of whispered prosody is relatively difficult. At the same time, studies have shown that speakers do attempt to produce intonation when whispering and that prosodic variability is being transmitted, suggesting that intonation "survives" in whispered formant structure. In this paper, we aim to estimate the way in which formant contours correlate with an "implicit" pitch contour in whisper, using a machine learning model. We propose a two-step method: using a parallel corpus, we first transform the whispered formants into their phonated equivalents using a denoising autoencoder. We then analyse the formant contours to predict phonated pitch contour variation. We observe that our method is effective in establishing a relationship between whispered and phonated formants and in uncovering implicit pitch contours in whisper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. W. Heeren and V. J. van Heuven, “Perception and production of boundary tones in whispered dutch,” Proceedings of Interspeech, September 6-10, 2009, pp. 2411–2414, 2009.
  2. J. N. Holmes and A. P. Stephens, “Acoustic correlates of intonation in whispered speech,” The Journal of the Acoustical Society of America, vol. 73, no. S1, pp. S87–S87, 1983.
  3. M. Higashikawa, K. Nakai, A. Sakakura, and H. Takahashi, “Perceived pitch of whispered vowels-relationship with formant frequencies: A preliminary study,” Journal of Voice, vol. 10, no. 2, pp. 155–158, 1996. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0892199796800427
  4. M. Żygis, D. Pape, L. L. Koenig, M. Jaskuła, and L. M. Jesus, “Segmental cues to intonation of statements and polar questions in whispered, semi-whispered and normal speech modes,” Journal of Phonetics, vol. 63, pp. 53–74, 2017.
  5. V.-A. Tran, G. Bailly, H. Lœvenbruck, and T. Toda, “Improvement to a nam-captured whisper-to-speech system,” Speech communication, vol. 52, no. 4, pp. 314–326, 2010.
  6. I. Eklund and H. Traunmüller, “Comparative study of male and female whispered and phonated versions of the long vowels of swedish,” Phonetica, vol. 54, no. 1, pp. 1–21, 1997.
  7. J. Coleman, E. Grabe, and B. Braun, “Larynx movements and intonation in whispered speech,” Summary of research supported by British Academy, 2002.
  8. N. J. Shah, M. Parmar, N. Shah, and H. A. Patil, “Novel mmse discogan for cross-domain whisper-to-speech conversion,” in Machine Learning in Speech and Language Processing (MLSLP) Workshop.   Google Office, 2018, pp. 1–3.
  9. H. Malaviya, J. Shah, M. Patel, J. Munshi, and H. A. Patil, “Mspec-net: Multi-domain speech conversion network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7764–7768.
  10. I. V. Mcloughlin, H. R. Sharifzadeh, S. L. Tan, J. Li, and Y. Song, “Reconstruction of phonated speech from whispers using formant-derived plausible pitch modulation,” ACM Trans. Access. Comput., vol. 6, no. 4, may 2015. [Online]. Available: https://doi.org/10.1145/2737724
  11. F. Cummins, M. Grimaldi, T. Leonard, and J. Simko, “The chains corpus: Characterizing individual speakers,” in Proc of SPECOM, vol. 6, no. 2006, 2006, pp. 431–435.
  12. J. Grosman, “Fine-tuned wav2vec2 large model for speech recognition in English,” https://huggingface.co/jonatasgrosman/wav2vec2-large-english, 2021.
  13. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502.
  14. P. Escudero, P. Boersma, A. S. Rauber, and R. A. Bion, “A cross-dialect acoustic description of vowels: Brazilian and european portuguese,” The Journal of the Acoustical Society of America, vol. 126, no. 3, pp. 1379–1393, 2009.
  15. B. M. Lobanov, “Classification of russian vowels spoken by different speakers,” The Journal of the Acoustical Society of America, vol. 49, no. 2B, pp. 606–608, 1971.
  16. X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder.” in Interspeech, vol. 2013, 2013, pp. 436–440.
  17. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds., vol. 27.   Curran Associates, Inc., 2014. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf
  18. M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,” Signal Processing, IEEE Transactions on, vol. 45, pp. 2673 – 2681, 12 1997.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com