Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vocal Tract Area Estimation by Gradient Descent (2307.04702v1)

Published 10 Jul 2023 in cs.SD and eess.AS

Abstract: Articulatory features can provide interpretable and flexible controls for the synthesis of human vocalizations by allowing the user to directly modify parameters like vocal strain or lip position. To make this manipulation through resynthesis possible, we need to estimate the features that result in a desired vocalization directly from audio recordings. In this work, we propose a white-box optimization technique for estimating glottal source parameters and vocal tract shapes from audio recordings of human vowels. The approach is based on inverse filtering and optimizing the frequency response of a wave-guide model of the vocal tract with gradient descent, propagating error gradients through the mapping of articulatory features to the vocal tract area function. We apply this method to the task of matching the sound of the Pink Trombone, an interactive articulatory synthesizer, to a given vocalization. We find that our method accurately recovers control functions for audio generated by the Pink Trombone itself. We then compare our technique against evolutionary optimization algorithms and a neural network trained to predict control parameters from audio. A subjective evaluation finds that our approach outperforms these black-box optimization baselines on the task of reproducing human vocalizations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. P. Birkholz, “Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis,” PLoS ONE, vol. 8, Apr. 2013.
  2. B. J. Kröger, “Computer-Implemented Articulatory Models for Speech Production: A Review,” Frontiers in Robotics and AI, vol. 9, 2022.
  3. B. H. Story, I. R. Titze, and E. A. Hoffman, “Vocal tract area functions from magnetic resonance imaging,” The Journal of the Acoustical Society of America, vol. 100, pp. 537–554, July 1996.
  4. A. Toutios and S. Narayanan, “Articulatory synthesis of French connected speech from EMA data,” in Interspeech, pp. 2738–2742, Aug. 2013.
  5. J. Dang and K. Honda, “Estimation of vocal tract shapes from speech sounds with a physiological articulatory model,” Journal of Phonetics, vol. 30, pp. 511–532, July 2002.
  6. P. Liu, Q. Yu, Z. Wu, S. Kang, H. Meng, and L. Cai, “A deep recurrent approach for acoustic-to-articulatory inversion,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4450–4454, Apr. 2015.
  7. B. S. Atal, J. J. Chang, M. V. Mathews, and J. W. Tukey, “Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique,” The Journal of the Acoustical Society of America, vol. 63, pp. 1535–1555, May 1978.
  8. V. N. Sorokin, A. S. Leonov, and A. V. Trushkin, “Estimation of stability and accuracy of inverse problem solution for the vocal tract,” Speech Communication, vol. 30, pp. 55–74, Jan. 2000.
  9. K. Richmond, Estimating Articulatory Parameters from the Acoustic Speech Signal. PhD thesis, University of Edinburgh, 2001.
  10. J. Riionheimo and V. Välimäki, “Parameter Estimation of a Plucked String Synthesis Model Using a Genetic Algorithm with Perceptual Fitness Calculation,” EURASIP Journal on Advances in Signal Processing, vol. 2003, Dec. 2003.
  11. C. Cooper, D. Murphy, D. Howard, and A. Tyrrell, “Singing synthesis with an evolved physical model,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 4, pp. 1454–1461, 2006.
  12. O. Schleusing, T. Kinnunen, B. Story, and J.-M. Vesin, “Joint Source-Filter Optimization for Accurate Vocal Tract Estimation Using Differential Evolution,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, pp. 1560–1572, Aug. 2013.
  13. Y. Gao, S. Stone, and P. Birkholz, “Articulatory Copy Synthesis Based on a Genetic Algorithm,” in Interspeech, pp. 3770–3774, Sept. 2019.
  14. N. Masuda and D. Saito, “Quality Diversity for Synthesizer Sound Matching,” in 24th International Conference on Digital Audio Effects (DAFx), Sept. 2021.
  15. M. A. Ismail, “Vocal Tract Area Function Estimation Using Particle Swarm,” Journal of Computers, vol. 3, pp. 32–38, June 2008.
  16. L. Gabrielli, S. Tomassetti, S. Squartini, and C. Zinato, “Introducing deep machine learning for parameter estimation in physical modelling,” in 20th International Conference on Digital Audio Effects (DAFx), Sept. 2017.
  17. M. J. Yee-King, L. Fedden, and M. d’Inverno, “Automatic Programming of VST Sound Synthesizers Using Deep Networks and Other Techniques,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 150–159, 2018.
  18. P. Saha and S. Fels, “Learning joint articulatory-acoustic representations with normalizing flows,” in Interspeech, pp. 3196–3200, 2020.
  19. H. Shibata, M. Zhang, and T. Shinozaki, “Unsupervised Acoustic-to-Articulatory Inversion Neural Network Learning Based on Deterministic Policy Gradient,” in 2021 IEEE Spoken Language Technology Workshop, (Shenzhen, China), pp. 530–537, Jan. 2021.
  20. M. A. Martínez Ramírez, O. Wang, P. Smaragdis, and N. J. Bryan, “Differentiable Signal Processing With Black-Box Audio Effects,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 66–70, June 2021.
  21. V. Chatziioannou and M. van Walstijn, “Estimation of Clarinet Reed Parameters by Inverse Modelling,” Acta Acustica, vol. 98, pp. 629–639, July 2012.
  22. W. J. Wilkinson, J. D. Reiss, and D. Stowell, “Latent force models for sound: Learning modal synthesis parameters and excitation functions from audio recordings,” in 20th International Conference on Digital Audio Effects (DAFx), pp. 56–63, 2017.
  23. J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Digital Signal Processing,” in International Conference on Learning Representations, 2020.
  24. J. T. Colonel, C. J. Steinmetz, M. Michelen, and J. D. Reiss, “Direct design of biquad filter cascades with deep learning by sampling random polynomials,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Feb. 2022.
  25. F. Caspe, A. McPherson, and M. Sandler, “DDX7: Differentiable FM Synthesis of Musical Instrument Sounds,” in 23rd International Society for Music Information Retrieval Conference, 2022.
  26. R. Diaz, B. Hayes, C. Saitis, G. Fazekas, and M. Sandler, “Rigid-Body Sound Synthesis with Differentiable Modal Resonators.” http://arxiv.org/abs/2210.15306, Oct. 2022.
  27. T. Smyth and D. Zurale, “On The Transfer Function of the Piecewise-Cylindrical Vocal Tract Model,” in 18th Sound and Music Computing Conference, 2021.
  28. O. Perrotin and I. McLoughlin, “A Spectral Glottal Flow Model for Source-filter Separation of Speech,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019.
  29. G. Fant, J. Liljencrants, and Q.-G. Lin, “A four-parameter model of glottal flow,” STL-QPSR, vol. 26, no. 4, 1985.
  30. H.-L. Lu and J. O. Smith, “Glottal source modeling for singing voice synthesis,” in International Computer Music Conference, 2000.
  31. G. Fant, “The LF-model revisited. Transformations and frequency domain analysis,” STL-QPSR, vol. 2, no. 3, 1995.
  32. A. de Cheveigné and H. Kawahara, “YIN, a Fundamental Frequency Estimator for Speech and Music,” The Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, 2002.
  33. B. H. Story, “A parametric model of the vocal tract area function for vowel and consonant simulation,” The Journal of the Acoustical Society of America, vol. 117, no. 5, pp. 3231–3254, 2005.
  34. J. L. Kelly and C. C. Lochbaum, “Speech Synthesis,” in Stockholm Speech Communication Seminar, 1962.
  35. V. Välimäki and M. Karjalainen, “Improving the Kelly-Lochbaum Vocal Tract Model using Conical Tube Sections and Fractional Delay Filtering Techniques.,” in International Conference on Spoken Language Processing (ICSLP), 1994.
  36. Abraham. Savitzky and M. J. E. Golay, “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.,” Analytical Chemistry, vol. 36, pp. 1627–1639, July 1964.
  37. D. Barry, Q. Zhang, P. W. Sun, and A. Hines, “Go Listen: An End-to-End Online Listening Test Platform,” Journal of Open Research Software, vol. 9, July 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.