Silent Speech Interfaces: Biosignal Decoding
- Silent Speech Interfaces are systems that decode speech content from non-acoustic biosignals such as EMG and ultrasound, enabling communication for those unable to vocalize or in privacy-sensitive settings.
- They employ diverse sensor modalities—including EMG, ultrasound imaging, lip video, accelerometers, and wireless sensing—with specific preprocessing and feature extraction techniques tailored to each signal type.
- Advanced machine learning models like CNNs, RNNs, and Transformers enable open-vocabulary recognition and speaker adaptation in SSI, achieving robust performance and competitive word error rates.
Silent Speech Interfaces (SSI) are systems that reconstruct speech content—either as text or acoustic signal—directly from biosignals produced during articulation, bypassing the need for an acoustic speech signal. SSIs decode the intended message from non-acoustic data such as ultrasounds of tongue movement, electromyographic activity, or other articulatory, muscular, or neural signals, thereby enabling communication for people who cannot vocalize, or in noise-sensitive or privacy-sensitive environments. Research in this field addresses both direct synthesis of speech waveforms and recognition or transcription of silent articulation, encompassing a diverse range of signal modalities, algorithms, and application domains (Gonzalez-Lopez et al., 2020).
1. SSI Modalities and Biosignal Acquisition
Silent Speech Interfaces span several biosignal acquisition modalities, each with distinct technical characteristics and suitability for different user populations.
- Electromyography (EMG): Surface EMG employs non-invasive electrodes to record myoelectric potentials from speech-related muscles (e.g., orbicularis oris, masseter, suprahyoid, laryngeal muscles). Multi-channel systems support up to 14 differential channels in wearable neckbands or textile arrays embedded in headphones or chokers (Tang et al., 11 Apr 2025, Meier et al., 26 Sep 2025, Tang et al., 2023). High-density, dry textile electrodes now support robust, day-long operation (Tang et al., 11 Apr 2025). Signal quality is challenged by inter-session impedance variability and motion artifacts.
- Articulator Motion Capture:
- Ultrasound Tongue Imaging (UTI): Portable, high-frame-rate B-mode US probes capture dynamic midsagittal tongue profiles for articulatory-to-acoustic mapping (Shandiz et al., 2021, Tóth et al., 2021, Tóth et al., 2023).
- Lip Video: Synchronized frontal or profile camera streams can be used for visual speech recognition or integrated with UTI for multi-modal speech synthesis (Zheng et al., 2023, Petridis et al., 2018).
- Accelerometers: Arrays of six-axis MEMS sensors placed on the jaw, lips, and cheeks record low-frequency kinematic facial motion, supporting sentence-level recognition (Xie et al., 25 Feb 2025).
- Strain Sensors: Graphene-coated textile chokers with microcracked films provide ultralow-noise, high-sensitivity detection of thyroid and submandibular movement, supporting single-channel silent speech decoding at high energy efficiency (Tang et al., 2023).
- Acoustic and Wireless Sensing:
- Inaudible acoustic signals (17–23 kHz) reflected off facial articulators can be phase-tracked by commodity smartphone microphones, with phase-delta and double-delta features encoding detailed lip motion (Luo et al., 2020).
- Wi-Fi backscatter exploits frequency-shifted tag modulation to enable contactless, camera-free lip-motion capture for open-vocabulary silent speech recognition (Tian et al., 26 Jan 2026).
- Other Modalities: High-density neural recordings (ECoG, EEG), electromagnetic articulography (EMA/PMA), and in-ear echo-based sensing (consumer ANC earbuds) provide specialized routes for silent speech decoding and secure authentication (Dong et al., 18 Dec 2025, Gonzalez-Lopez et al., 2020).
2. Signal Processing and Feature Extraction
Each sensor modality imposes unique requirements on preprocessing and feature engineering:
- EMG: Standard processing includes bandpass filtering (10/20–400/450 Hz), wavelet denoising, rectification, and envelope extraction (RMS over 100 ms windows). Higher-dimensional time- and frequency-domain features (e.g., mean, variance, ZCR, spectral moments) are extracted per window/channels for classification (Lai et al., 2023, Meier et al., 26 Sep 2025).
- Articulatory Imaging (UTI/lip video): Raw image frames (resized and normalized) are used as direct input to convolutional architectures, or tongue contours and Eigentongue coefficients are computed as input features (Tóth et al., 2021, Zheng et al., 2023).
- Acoustic/Wireless Sensing:
- For acoustic lip radar or Wi-Fi echo systems, coherent demodulation yields complex baseband signals. Instantaneous phase and its derivatives (phase-delta, double-delta) are essential features for lip-motion capture (Luo et al., 2020, Tian et al., 26 Jan 2026).
- Accelerometers: Raw acceleration/angular velocity per channel are z-normalized, segmented, and optionally augmented with Gaussian noise or synthetic concatenations (Xie et al., 25 Feb 2025).
3. Core Modeling Architectures
Machine learning models for SSI are generally designed for either speech synthesis (articulatory-to-acoustic regression) or recognition (silent speech-to-text). Key architectural trends include:
- Convolutional Neural Networks (CNN/ResNet/3D-CNN):
- 2D-CNN or 3D-CNN encode spatial and spatiotemporal structure in UTI/lip video (Tóth et al., 2021, Tóth et al., 2023).
- 1D ResNets process EMG, strain sensor, or time-series features (Tang et al., 11 Apr 2025, Lai et al., 2023, Tang et al., 2023).
- Knowledge distillation, ensemble voting, and adaptive channel weighting (squeeze-and-excitation) further optimize robustness and model size (Lai et al., 2023, Tang et al., 11 Apr 2025).
- Recurrent and Sequential Models: BiLSTM or BiGRU layers capture temporal dynamics, particularly for modeling context in spectrogram generation from UTI/lip video or in visual-keyword spotting (Tóth et al., 2021, Su et al., 2023).
- Attention and Transformer Models:
- Transformers with self-attention provide full-utterance context for both EMG-to-mel-spectrogram mapping and ASR, mitigating local ambiguities induced by noisy biosignal transduction (Sivasubramaniam, 2 Sep 2025, Benster et al., 2024).
- Conformer architectures (CNN-Transformer hybrids) excel in temporal modeling for kinematic signals from accelerometer arrays (Xie et al., 25 Feb 2025).
- Lexicon-guided Transformer encoders with beam search support open-vocabulary decoding from Wi-Fi Doppler features (Tian et al., 26 Jan 2026).
- Multi-task and Domain Adaptation Mechanisms:
- Domain adversarial training and pseudo-target generation with dynamic time warping align silent and vocalized articulatory domains, improving cross-mode generalization (Zheng et al., 2023).
- Speaker adaptation modules using x-vectors facilitate transfer across speakers and sessions in multi-speaker ultrasound-to-speech systems (Shandiz et al., 2021, Tóth et al., 2023).
- Cross-Modal Contrastive Learning: Dual-encoder architectures (e.g., MONA) leverage contrastive losses to jointly align silent EMG and audio domains, enabling transfer learning from large audio corpora and facilitating superior open-vocabulary transcription (Benster et al., 2024).
4. Evaluation Methodologies and Key Performance Metrics
SSI research emphasizes robust quantitative and subjective evaluation using modality- and task-specific metrics:
- Speech Recognition: Main metrics are word error rate (WER), sentence/command accuracy, confusion matrix analysis, and macro-averaged F1 scores. State-of-the-art EMG-based open-vocabulary models achieve WERs of 12.2% with LLM rescoring, a substantial reduction from prior 28.8% benchmarks (Benster et al., 2024).
- Speech Synthesis: Metrics include Mel-Cepstral Distortion (MCD) for spectral accuracy (≈3–5 dB on UTI/lip systems), mean squared error (MSE) on predicted spectral frames, and subjective naturalness (MOS) (Shandiz et al., 2021, Zheng et al., 2023).
- Robustness: Session-to-session, speaker-independent, and real-noise evaluations quantify system generality. Session-independent accuracy often drops 10–20% but domain adaptation and speaker embeddings can partially recover this gap (Meier et al., 26 Sep 2025, Tóth et al., 2023).
- Latency and Efficiency: Model inference time and power consumption are significant factors for wearable adoption; e.g., textile strainer chokers operate at <0.1 GFLOPS per inference and headphone-based wireless EMG systems at <200 mW (Tang et al., 2023, Tang et al., 11 Apr 2025).
5. Adaptive Methods and User Personalization
SSI performance is fundamentally limited by intra- and inter-speaker/session variability. Notable adaptive strategies include:
- Speaker/Session Adaptation: Spatial transformer networks (STN) enable rapid adaptation to new speakers or headset repositioning, closing up to 88–92% of the cross-domain MSE gap with only ~10% of network parameters requiring tuning (Tóth et al., 2023).
- Dynamic Channel Attention: Adaptive neural mechanisms (e.g., SE-ResNet) recalibrate channel weights in response to electrode impedance variability, enhancing decoding robustness under real-world conditions (Tang et al., 11 Apr 2025).
- On-Device Customization: Contrastive learning frameworks (LipLearner) support few-shot command personalization, enabling users to enroll novel, speech or non-speech lip gestures with high F1-score (>0.89 with one-shot registration) directly on mobile (Su et al., 2023).
- Multi-Task Learning and Security: Integrated authentication and silent spelling decoding (HEar-ID) exploit shared encoder embeddings from commodity earbuds, achieving 67% Top-1 accuracy and low equal error rates (<6.1%) for speaker verification (Dong et al., 18 Dec 2025).
6. Multimodal, Open-Vocabulary, and Contactless SSI
Recent research advances the breadth and flexibility of SSI systems:
- Open-Vocabulary Recognition: Cross-modal contrastive models and LLM-based scoring adjustment now support single-speaker, open-vocabulary EMG-to-text transcription with WERs below 15% (Benster et al., 2024). Wi-Fi backscatter approaches achieve open-sentence recognition (WER ≈36.9%), nearing the state-of-the-art for vision-based lipreading (Tian et al., 26 Jan 2026).
- Contactless and Camera-Free Approaches:
- Acoustic sensing with cosine phase-delta features and Wi-Fi Doppler/TDD tags enable silent speech decoding without on-body devices or cameras. Such systems achieve speaker/environment-independent WERs of 8.4–36.9%, with real-time inference (Luo et al., 2020, Tian et al., 26 Jan 2026).
- Sentence-Level, Continuous Recognition:
- Six-axis accelerometer arrays with conformer-CTC models achieve ≈97% accuracy in silent sentence recognition across English and Chinese phrases, handling segmentation, elision, and linking with minimal speaker dependence (Xie et al., 25 Feb 2025).
- Integration and Personalization: Multimodal pipelines combining visual, muscular, and acoustic features, with user-initiated few-shot adaptation and hands-free activation, are feasible on modern mobile hardware (Su et al., 2023).
7. Limitations, Challenges, and Future Directions
Despite rapid progress, several critical limitations persist:
- Variability and Generalization: Session-to-session and speaker-to-speaker variability remain major obstacles. Adaptive architectures, domain adversarial learning, and cross-modal pretraining partially address these, but universality across large cohorts has not been demonstrated (Shandiz et al., 2021, Tóth et al., 2023).
- Vocabularic and Prosodic Coverage: Most SSI systems operate on limited command sets; open-vocabulary and continuous speech remain challenging, especially in speaker-independent settings. Prosody and paralinguistic cues are underexplored due to reduced signal quality in non-acoustic modalities (Ren et al., 25 Aug 2025, Benster et al., 2024).
- User Comfort and Practicality: Sensor miniaturization, dry-electrode reliability, low-power operation, wearability, and privacy constraints dictate real-world deployment. Graphene-textile sensors, wireless neckbands, and headphone-based EMG arrays demonstrate progress but require further validation over long-term, daily use (Tang et al., 2023, Meier et al., 26 Sep 2025).
- Ethical, Security, and Clinical Aspects: Secure authentication (e.g., in-ear echo, multimodal embedding) and resistance to spoofing are emerging priorities (Dong et al., 18 Dec 2025). Clinical studies in target populations (e.g., laryngectomy, ALS) and large-scale, open, multi-modal datasets will be essential for field maturity (Gonzalez-Lopez et al., 2020).
- Integration with LLMs and Paralinguistics: LLMs integrated with SSI pipelines (as in LISA and GER post-processing) dramatically improve open-vocabulary recognition and error correction (Benster et al., 2024, Sivasubramaniam, 2 Sep 2025). The extraction and synthesis of paralinguistic information (affective states, emotion) directly from silent biosignals remains an open research area (Ren et al., 25 Aug 2025).
References
- (Gonzalez-Lopez et al., 2020) Silent Speech Interfaces for Speech Restoration: A Review
- (Tóth et al., 2021) 3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces
- (Shandiz et al., 2021) Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks
- (Shandiz et al., 2021) Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
- (Su et al., 2023) LipLearner: Customizable Silent Speech Interactions on Mobile Devices
- (Zheng et al., 2023) Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training
- (Tóth et al., 2023) Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using Spatial Transformer Networks
- (Lai et al., 2023) Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface
- (Tang et al., 2023) Ultrasensitive Textile Strain Sensors Redefine Wearable Silent Speech Interfaces with High Machine Learning Efficiency
- (Benster et al., 2024) A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition
- (Xie et al., 25 Feb 2025) Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm
- (Tang et al., 11 Apr 2025) Wireless Silent Speech Interface Using Multi-Channel Textile EMG Sensors Integrated into Headphones
- (Ren et al., 25 Aug 2025) An Introduction to Silent Paralinguistics
- (Sivasubramaniam, 2 Sep 2025) From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach
- (Meier et al., 26 Sep 2025) A Parallel Ultra-Low Power Silent Speech Interface based on a Wearable, Fully-dry EMG Neckband
- (Dong et al., 18 Dec 2025) Poster: Recognizing Hidden-in-the-Ear Private Key for Reliable Silent Speech Interface Using Multi-Task Learning
- (Tian et al., 26 Jan 2026) Lip-Siri: Contactless Open-Sentence Silent Speech with Wi-Fi Backscatter
- (Luo et al., 2020) End-to-end Silent Speech Recognition with Acoustic Sensing
- (Petridis et al., 2018) Visual-Only Recognition of Normal, Whispered and Silent Speech