Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition (2403.05583v1)

Published 2 Mar 2024 in cs.HC, cs.AI, cs.SD, and eess.AS

Abstract: Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of LLM Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  2. Baker, J. K. Machine-aided labeling of connected speech. In Working Papers in Speech Recognition XI, Technical Reports, Pittsburgh, PA, 1973. Computer Science Department, Carnegie-Mellon University.
  3. The shattered gradients problem: If resnets are the answer, then what is the question? In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  342–350. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/balduzzi17b.html.
  4. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. Subvocalization in singers: Laryngoscopy and surface emg effects when imagining and listening to song and text. Psychology of Music, 49(3):567–580, November 2019. ISSN 1741-3087. doi: 10.1177/0305735619883681. URL http://dx.doi.org/10.1177/0305735619883681.
  6. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training, 2021.
  7. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5(10):1097–1107, October 2023. ISSN 2522-5839. doi: 10.1038/s42256-023-00714-5. URL http://dx.doi.org/10.1038/s42256-023-00714-5.
  8. Subvocal speech recognition via close-talk microphone and surface electromyogram using deep learning. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), pp.  165–168, 2017. doi: 10.15439/2017F153.
  9. Gaddy, D. Voicing Silent Speech. PhD thesis, EECS Department, University of California, Berkeley, May 2022. URL http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-68.html.
  10. Digital voicing of silent speech. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5521–5530, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.445. URL https://aclanthology.org/2020.emnlp-main.445.
  11. An improved model for voicing silent speech. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  175–181, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.23. URL https://aclanthology.org/2021.acl-short.23.
  12. Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 2016.
  13. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning - ICML ’06, ICML ’06. ACM Press, 2006. doi: 10.1145/1143844.1143891. URL http://dx.doi.org/10.1145/1143844.1143891.
  14. Gururani, S. Validation loss increasing while wer decreases. https://github.com/SeanNaren/deepspeech.pytorch/issues/78, 2017. deepspeech.pytorch GitHub issue #78.
  15. Deep speech: Scaling up end-to-end speech recognition. CoRR, abs/1412.5567, 2014. URL http://arxiv.org/abs/1412.5567.
  16. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. doi: 10.1109/MSP.2012.2205597.
  17. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. doi: 10.1162/neco.1991.3.1.79.
  18. Mixtral of experts, 2024.
  19. Ears: Electromyographical automatic recognition of speech. In International Conference on Bio-inspired Systems and Signal Processing, 2008. URL https://api.semanticscholar.org/CorpusID:5092817.
  20. Towards continuous speech recognition using surface electromyography. In Interspeech, 2006. URL https://api.semanticscholar.org/CorpusID:389078.
  21. Alterego: A personalized wearable silent speech interface. In 23rd International Conference on Intelligent User Interfaces, IUI’18. ACM, March 2018. doi: 10.1145/3172944.3172977. URL http://dx.doi.org/10.1145/3172944.3172977.
  22. Supervised contrastive learning. CoRR, abs/2004.11362, 2020. URL https://arxiv.org/abs/2004.11362.
  23. Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces. Nature Communications, 13(1), October 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-33457-9. URL http://dx.doi.org/10.1038/s41467-022-33457-9.
  24. Sottovoce: An ultrasound imaging-based silent speech interaction using deep neural networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19. ACM, May 2019. doi: 10.1145/3290605.3300376. URL http://dx.doi.org/10.1145/3290605.3300376.
  25. Speech recognition via fnirs based brain signals. Frontiers in Neuroscience, 12, October 2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00695. URL http://dx.doi.org/10.3389/fnins.2018.00695.
  26. A state-of-the-art review of eeg-based imagined speech decoding. Frontiers in Human Neuroscience, 16, 2022. ISSN 1662-5161. doi: 10.3389/fnhum.2022.867281. URL https://www.frontiersin.org/articles/10.3389/fnhum.2022.867281.
  27. Lowerre, B. T. The harpy speech recognition system, 1976.
  28. Session independent non-audible speech recognition using surface electromyography. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pp.  331–336, 2005. doi: 10.1109/ASRU.2005.1566521.
  29. A high-performance neuroprosthesis for speech decoding and avatar control. Nature, 620(7976):1037–1046, August 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06443-4. URL http://dx.doi.org/10.1038/s41586-023-06443-4.
  30. Research summary of a scheme to ascertain the availability of speech information in the myoelectric signals of neck and head muscles using surface electrodes. Computers in Biology and Medicine, 16(6):399–410, 1986. ISSN 0010-4825. doi: https://doi.org/10.1016/0010-4825(86)90064-8. URL https://www.sciencedirect.com/science/article/pii/0010482586900648.
  31. Non-audible murmur (nam) recognition. IEICE TRANSACTIONS on Information and Systems, 89(1):1–8, 2006.
  32. Can we decode phonetic features in inner speech using surface electromyography? PLOS ONE, 15(5):e0233282, May 2020. ISSN 1932-6203. doi: 10.1371/journal.pone.0233282. URL http://dx.doi.org/10.1371/journal.pone.0233282.
  33. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, April 2015. doi: 10.1109/icassp.2015.7178964. URL http://dx.doi.org/10.1109/ICASSP.2015.7178964.
  34. Acceptability of speech and silent speech input methods in private and public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. ACM, May 2021. doi: 10.1145/3411764.3445430. URL http://dx.doi.org/10.1145/3411764.3445430.
  35. psydok. How to interpret the training result: High loss, low wer? https://github.com/NVIDIA/NeMo/discussions/4423, 2022. NVIDIA NeMo GitHub issue #4423.
  36. Robust speech recognition via large-scale weak supervision, 2022.
  37. Self-learning and active-learning for electromyography-to-speech conversion. In 15th ITG Conference on Speech Communication, 10 2023.
  38. Learnable latent embeddings for joint behavioural and neural analysis. Nature, 617(7960):360–368, May 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06031-6. URL http://dx.doi.org/10.1038/s41586-023-06031-6.
  39. Modeling coarticulation in emg-based continuous speech recognition. Speech Communication, 52(4):341–353, 2010. ISSN 0167-6393. doi: https://doi.org/10.1016/j.specom.2009.12.002. URL https://www.sciencedirect.com/science/article/pii/S0167639309001770. Silent Speech Interfaces.
  40. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers, 2023.
  41. Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
  42. A speech prosthesis employing a speech synthesizer-vowel discrimination from perioral muscle activities and vowel production. IEEE Transactions on Biomedical Engineering, BME-32(7):485–490, 1985. doi: 10.1109/TBME.1985.325564.
  43. Earssr: Silent speech recognition via earphones. IEEE Transactions on Mobile Computing, pp.  1–17, 2024. doi: 10.1109/TMC.2024.3356719.
  44. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26(5):858–866, May 2023. ISSN 1546-1726. doi: 10.1038/s41593-023-01304-9. URL http://dx.doi.org/10.1038/s41593-023-01304-9.
  45. Llama 2: Open foundation and fine-tuned chat models, 2023.
  46. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.
  47. Silent speech command word recognition using stepped frequency continuous wave radar. Scientific Reports, 12(1), March 2022. ISSN 2045-2322. doi: 10.1038/s41598-022-07842-9. URL http://dx.doi.org/10.1038/s41598-022-07842-9.
  48. A high-performance speech neuroprosthesis. Nature, 620(7976):1031–1036, August 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06377-x. URL http://dx.doi.org/10.1038/s41586-023-06377-x.
Citations (4)

Summary

  • The paper introduces MONA, a novel system aligning audio and EMG data using cross-contrast and supervised temporal contrast losses with dynamic time warping.
  • It implements LLM-integrated scoring (LISA) to refine candidate sentences, significantly boosting recognition accuracy.
  • Empirical results show a word error rate of 12.2% for silent speech and improvement from 23.3% to 3.7% for vocalized EMG recordings.

Enhanced Silent Speech Recognition through Cross-Modal Learning and LLM-Enhanced Scoring

Introduction

Silent Speech Interfaces (SSI) hold transformative potential for communication technologies, particularly for individuals with speech impairments or in situations where vocal communication is not possible. Despite the promise of SSIs, significant challenges have impeded their development, notably the absence of phonetic content and limited datasets for effective training. A paper presents an innovative approach that leverages cross-modal learning and the integration of LLMs to address these challenges, demonstrating significant improvements in silent speech recognition accuracy.

Background

The development of SSIs has seen various technological approaches, each with its unique advantages and limitations. Among these, lip reading and surface electromyography (EMG) have emerged as promising techniques for silent speech decoding. Unlike auditory methods, EMG captures muscle activity associated with speech articulation, offering a potential advantage in speech recognition tasks.

Prior research in the field of Automatic Speech Recognition (ASR) has achieved considerable success, largely benefiting from advanced algorithms, neural network architectures, and expansive training datasets. However, transferring these advances to SSIs has been constrained by the unique challenges silent speech presents.

Proposed Approach

The paper introduces a novel system termed Multimodal Orofacial Neural Audio (MONA) and incorporates a new scoring adjustment method utilizing LLMs, named LISA. This approach seeks to improve silent speech recognition accuracy by:

  1. Cross-Modal Learning: MONA employs two unique loss functions—cross-contrast (crossCon) and supervised temporal contrast (supTcon)—to align latent representations across different modalities (audio and EMG) within a shared latent space. This alignment is facilitated through the novel application of dynamic time warping in conjunction with these contrastive loss functions, enabling effective model training using both synchronized and independent datasets.
  2. LLM-Integrated Scoring Adjustment (LISA): Beyond the neural network's predictions, LISA leverages the capability of LLMs to refine and choose the best candidate sentences. This post-processing step significantly enhances the recognition accuracy by selecting the most linguistically probable and coherent sentences from a set of top predictions.

Empirical Evaluation

The proposed approach underwent rigorous evaluation using several benchmark datasets, notably the Gaddy Silent Speech dataset. Results from the paper highlight the effectiveness of the cross-modal learning and LISA:

  • A significant reduction in word error rate (WER) for silent speech, achieving a record low WER of 12.2% on an open vocabulary.
  • For vocalized EMG recordings, the approach demonstrated an improvement from a WER of 23.3% to an astonishingly low rate of 3.7%.

Implications and Future Directions

The findings from this paper represent a significant step forward in the realization of practical and accurate silent speech interfaces. By achieving a WER below the 15% threshold, this work signals a pivotal shift towards the broader applicability of SSIs in real-world scenarios. It not only underscores the potential of EMG as a viable modality for silent speech recognition but also illustrates the profound impact of integrating large-scale LLMs in refining speech recognition accuracy.

Looking ahead, the methodologies introduced in this paper have the potential to be extended to a wider range of speech modalities, paving the way for more robust and versatile silent speech interfaces. Furthermore, the use of cross-modal learning strategies in other data-limited domains suggests a promising avenue for future research in machine learning and human-computer interaction.

Conclusion

This paper showcases a significant leap in silent speech recognition technology, driven by innovative cross-modal learning techniques and the strategic integration of LLM-enhanced scoring. As research in this field progresses, the envisioned future where SSIs offer seamless and accurate communication for all individuals moves ever closer to reality.