Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract (2008.00889v1)

Published 3 Aug 2020 in eess.AS, cs.SD, and eess.IV

Abstract: Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques (e.g. ultrasound tongue imaging, lip video). Real-time MRI (rtMRI) of the vocal tract has not been used before for this purpose. The advantage of MRI is that it has a high relative' spatial resolution: it can capture not only lingual, labial and jaw motion, but also the velum and the pharyngeal region, which is typically not possible with other techniques. In the current paper, we train various DNNs (fully connected, convolutional and recurrent neural networks) for articulatory-to-speech conversion, using rtMRI as input, in a speaker-specific way. We use two male and two female speakers of the USC-TIMIT articulatory database, each of them uttering 460 sentences. We evaluate the results with objective (Normalized MSE and MCD) and subjective measures (perceptual test) and show that CNN-LSTM networks are preferred which take multiple images as input, and achieve MCD scores between 2.8-4.5 dB. In the experiments, we find that the predictions of speakerm1' are significantly weaker than other speakers. We show that this is caused by the fact that 74% of the recordings of speaker `m1' are out of sync.

Citations (8)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Speaker dependent articulatory-to-acoustic mapping using real-time MRI of the vocal tract (2008.00889v1)

Summary

Related Papers