A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition (2403.05583v1)
Abstract: Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of LLM Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.
- Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- Baker, J. K. Machine-aided labeling of connected speech. In Working Papers in Speech Recognition XI, Technical Reports, Pittsburgh, PA, 1973. Computer Science Department, Carnegie-Mellon University.
- The shattered gradients problem: If resnets are the answer, then what is the question? In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 342–350. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/balduzzi17b.html.
- Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Subvocalization in singers: Laryngoscopy and surface emg effects when imagining and listening to song and text. Psychology of Music, 49(3):567–580, November 2019. ISSN 1741-3087. doi: 10.1177/0305735619883681. URL http://dx.doi.org/10.1177/0305735619883681.
- W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training, 2021.
- Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, 5(10):1097–1107, October 2023. ISSN 2522-5839. doi: 10.1038/s42256-023-00714-5. URL http://dx.doi.org/10.1038/s42256-023-00714-5.
- Subvocal speech recognition via close-talk microphone and surface electromyogram using deep learning. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 165–168, 2017. doi: 10.15439/2017F153.
- Gaddy, D. Voicing Silent Speech. PhD thesis, EECS Department, University of California, Berkeley, May 2022. URL http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-68.html.
- Digital voicing of silent speech. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5521–5530, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.445. URL https://aclanthology.org/2020.emnlp-main.445.
- An improved model for voicing silent speech. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 175–181, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.23. URL https://aclanthology.org/2021.acl-short.23.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 2016.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning - ICML ’06, ICML ’06. ACM Press, 2006. doi: 10.1145/1143844.1143891. URL http://dx.doi.org/10.1145/1143844.1143891.
- Gururani, S. Validation loss increasing while wer decreases. https://github.com/SeanNaren/deepspeech.pytorch/issues/78, 2017. deepspeech.pytorch GitHub issue #78.
- Deep speech: Scaling up end-to-end speech recognition. CoRR, abs/1412.5567, 2014. URL http://arxiv.org/abs/1412.5567.
- Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. doi: 10.1109/MSP.2012.2205597.
- Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991. doi: 10.1162/neco.1991.3.1.79.
- Mixtral of experts, 2024.
- Ears: Electromyographical automatic recognition of speech. In International Conference on Bio-inspired Systems and Signal Processing, 2008. URL https://api.semanticscholar.org/CorpusID:5092817.
- Towards continuous speech recognition using surface electromyography. In Interspeech, 2006. URL https://api.semanticscholar.org/CorpusID:389078.
- Alterego: A personalized wearable silent speech interface. In 23rd International Conference on Intelligent User Interfaces, IUI’18. ACM, March 2018. doi: 10.1145/3172944.3172977. URL http://dx.doi.org/10.1145/3172944.3172977.
- Supervised contrastive learning. CoRR, abs/2004.11362, 2020. URL https://arxiv.org/abs/2004.11362.
- Ultrathin crystalline-silicon-based strain gauges with deep learning algorithms for silent speech interfaces. Nature Communications, 13(1), October 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-33457-9. URL http://dx.doi.org/10.1038/s41467-022-33457-9.
- Sottovoce: An ultrasound imaging-based silent speech interaction using deep neural networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19. ACM, May 2019. doi: 10.1145/3290605.3300376. URL http://dx.doi.org/10.1145/3290605.3300376.
- Speech recognition via fnirs based brain signals. Frontiers in Neuroscience, 12, October 2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00695. URL http://dx.doi.org/10.3389/fnins.2018.00695.
- A state-of-the-art review of eeg-based imagined speech decoding. Frontiers in Human Neuroscience, 16, 2022. ISSN 1662-5161. doi: 10.3389/fnhum.2022.867281. URL https://www.frontiersin.org/articles/10.3389/fnhum.2022.867281.
- Lowerre, B. T. The harpy speech recognition system, 1976.
- Session independent non-audible speech recognition using surface electromyography. In IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., pp. 331–336, 2005. doi: 10.1109/ASRU.2005.1566521.
- A high-performance neuroprosthesis for speech decoding and avatar control. Nature, 620(7976):1037–1046, August 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06443-4. URL http://dx.doi.org/10.1038/s41586-023-06443-4.
- Research summary of a scheme to ascertain the availability of speech information in the myoelectric signals of neck and head muscles using surface electrodes. Computers in Biology and Medicine, 16(6):399–410, 1986. ISSN 0010-4825. doi: https://doi.org/10.1016/0010-4825(86)90064-8. URL https://www.sciencedirect.com/science/article/pii/0010482586900648.
- Non-audible murmur (nam) recognition. IEICE TRANSACTIONS on Information and Systems, 89(1):1–8, 2006.
- Can we decode phonetic features in inner speech using surface electromyography? PLOS ONE, 15(5):e0233282, May 2020. ISSN 1932-6203. doi: 10.1371/journal.pone.0233282. URL http://dx.doi.org/10.1371/journal.pone.0233282.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, April 2015. doi: 10.1109/icassp.2015.7178964. URL http://dx.doi.org/10.1109/ICASSP.2015.7178964.
- Acceptability of speech and silent speech input methods in private and public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21. ACM, May 2021. doi: 10.1145/3411764.3445430. URL http://dx.doi.org/10.1145/3411764.3445430.
- psydok. How to interpret the training result: High loss, low wer? https://github.com/NVIDIA/NeMo/discussions/4423, 2022. NVIDIA NeMo GitHub issue #4423.
- Robust speech recognition via large-scale weak supervision, 2022.
- Self-learning and active-learning for electromyography-to-speech conversion. In 15th ITG Conference on Speech Communication, 10 2023.
- Learnable latent embeddings for joint behavioural and neural analysis. Nature, 617(7960):360–368, May 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06031-6. URL http://dx.doi.org/10.1038/s41586-023-06031-6.
- Modeling coarticulation in emg-based continuous speech recognition. Speech Communication, 52(4):341–353, 2010. ISSN 0167-6393. doi: https://doi.org/10.1016/j.specom.2009.12.002. URL https://www.sciencedirect.com/science/article/pii/S0167639309001770. Silent Speech Interfaces.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers, 2023.
- Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
- A speech prosthesis employing a speech synthesizer-vowel discrimination from perioral muscle activities and vowel production. IEEE Transactions on Biomedical Engineering, BME-32(7):485–490, 1985. doi: 10.1109/TBME.1985.325564.
- Earssr: Silent speech recognition via earphones. IEEE Transactions on Mobile Computing, pp. 1–17, 2024. doi: 10.1109/TMC.2024.3356719.
- Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26(5):858–866, May 2023. ISSN 1546-1726. doi: 10.1038/s41593-023-01304-9. URL http://dx.doi.org/10.1038/s41593-023-01304-9.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.
- Silent speech command word recognition using stepped frequency continuous wave radar. Scientific Reports, 12(1), March 2022. ISSN 2045-2322. doi: 10.1038/s41598-022-07842-9. URL http://dx.doi.org/10.1038/s41598-022-07842-9.
- A high-performance speech neuroprosthesis. Nature, 620(7976):1031–1036, August 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06377-x. URL http://dx.doi.org/10.1038/s41586-023-06377-x.