Papers
Topics
Authors
Recent
2000 character limit reached

Silent Speech Interfaces: Biosignal Decoding

Updated 10 February 2026
  • Silent Speech Interfaces are systems that decode speech content from non-acoustic biosignals such as EMG and ultrasound, enabling communication for those unable to vocalize or in privacy-sensitive settings.
  • They employ diverse sensor modalities—including EMG, ultrasound imaging, lip video, accelerometers, and wireless sensing—with specific preprocessing and feature extraction techniques tailored to each signal type.
  • Advanced machine learning models like CNNs, RNNs, and Transformers enable open-vocabulary recognition and speaker adaptation in SSI, achieving robust performance and competitive word error rates.

Silent Speech Interfaces (SSI) are systems that reconstruct speech content—either as text or acoustic signal—directly from biosignals produced during articulation, bypassing the need for an acoustic speech signal. SSIs decode the intended message from non-acoustic data such as ultrasounds of tongue movement, electromyographic activity, or other articulatory, muscular, or neural signals, thereby enabling communication for people who cannot vocalize, or in noise-sensitive or privacy-sensitive environments. Research in this field addresses both direct synthesis of speech waveforms and recognition or transcription of silent articulation, encompassing a diverse range of signal modalities, algorithms, and application domains (Gonzalez-Lopez et al., 2020).

1. SSI Modalities and Biosignal Acquisition

Silent Speech Interfaces span several biosignal acquisition modalities, each with distinct technical characteristics and suitability for different user populations.

  • Electromyography (EMG): Surface EMG employs non-invasive electrodes to record myoelectric potentials from speech-related muscles (e.g., orbicularis oris, masseter, suprahyoid, laryngeal muscles). Multi-channel systems support up to 14 differential channels in wearable neckbands or textile arrays embedded in headphones or chokers (Tang et al., 11 Apr 2025, Meier et al., 26 Sep 2025, Tang et al., 2023). High-density, dry textile electrodes now support robust, day-long operation (Tang et al., 11 Apr 2025). Signal quality is challenged by inter-session impedance variability and motion artifacts.
  • Articulator Motion Capture:
  • Strain Sensors: Graphene-coated textile chokers with microcracked films provide ultralow-noise, high-sensitivity detection of thyroid and submandibular movement, supporting single-channel silent speech decoding at high energy efficiency (Tang et al., 2023).
  • Acoustic and Wireless Sensing:
    • Inaudible acoustic signals (17–23 kHz) reflected off facial articulators can be phase-tracked by commodity smartphone microphones, with phase-delta and double-delta features encoding detailed lip motion (Luo et al., 2020).
    • Wi-Fi backscatter exploits frequency-shifted tag modulation to enable contactless, camera-free lip-motion capture for open-vocabulary silent speech recognition (Tian et al., 26 Jan 2026).
  • Other Modalities: High-density neural recordings (ECoG, EEG), electromagnetic articulography (EMA/PMA), and in-ear echo-based sensing (consumer ANC earbuds) provide specialized routes for silent speech decoding and secure authentication (Dong et al., 18 Dec 2025, Gonzalez-Lopez et al., 2020).

2. Signal Processing and Feature Extraction

Each sensor modality imposes unique requirements on preprocessing and feature engineering:

  • EMG: Standard processing includes bandpass filtering (10/20–400/450 Hz), wavelet denoising, rectification, and envelope extraction (RMS over 100 ms windows). Higher-dimensional time- and frequency-domain features (e.g., mean, variance, ZCR, spectral moments) are extracted per window/channels for classification (Lai et al., 2023, Meier et al., 26 Sep 2025).
  • Articulatory Imaging (UTI/lip video): Raw image frames (resized and normalized) are used as direct input to convolutional architectures, or tongue contours and Eigentongue coefficients are computed as input features (Tóth et al., 2021, Zheng et al., 2023).
  • Acoustic/Wireless Sensing:
    • For acoustic lip radar or Wi-Fi echo systems, coherent demodulation yields complex baseband signals. Instantaneous phase and its derivatives (phase-delta, double-delta) are essential features for lip-motion capture (Luo et al., 2020, Tian et al., 26 Jan 2026).
  • Accelerometers: Raw acceleration/angular velocity per channel are z-normalized, segmented, and optionally augmented with Gaussian noise or synthetic concatenations (Xie et al., 25 Feb 2025).

3. Core Modeling Architectures

Machine learning models for SSI are generally designed for either speech synthesis (articulatory-to-acoustic regression) or recognition (silent speech-to-text). Key architectural trends include:

4. Evaluation Methodologies and Key Performance Metrics

SSI research emphasizes robust quantitative and subjective evaluation using modality- and task-specific metrics:

  • Speech Recognition: Main metrics are word error rate (WER), sentence/command accuracy, confusion matrix analysis, and macro-averaged F1 scores. State-of-the-art EMG-based open-vocabulary models achieve WERs of 12.2% with LLM rescoring, a substantial reduction from prior 28.8% benchmarks (Benster et al., 2024).
  • Speech Synthesis: Metrics include Mel-Cepstral Distortion (MCD) for spectral accuracy (≈3–5 dB on UTI/lip systems), mean squared error (MSE) on predicted spectral frames, and subjective naturalness (MOS) (Shandiz et al., 2021, Zheng et al., 2023).
  • Robustness: Session-to-session, speaker-independent, and real-noise evaluations quantify system generality. Session-independent accuracy often drops 10–20% but domain adaptation and speaker embeddings can partially recover this gap (Meier et al., 26 Sep 2025, Tóth et al., 2023).
  • Latency and Efficiency: Model inference time and power consumption are significant factors for wearable adoption; e.g., textile strainer chokers operate at <0.1 GFLOPS per inference and headphone-based wireless EMG systems at <200 mW (Tang et al., 2023, Tang et al., 11 Apr 2025).

5. Adaptive Methods and User Personalization

SSI performance is fundamentally limited by intra- and inter-speaker/session variability. Notable adaptive strategies include:

  • Speaker/Session Adaptation: Spatial transformer networks (STN) enable rapid adaptation to new speakers or headset repositioning, closing up to 88–92% of the cross-domain MSE gap with only ~10% of network parameters requiring tuning (Tóth et al., 2023).
  • Dynamic Channel Attention: Adaptive neural mechanisms (e.g., SE-ResNet) recalibrate channel weights in response to electrode impedance variability, enhancing decoding robustness under real-world conditions (Tang et al., 11 Apr 2025).
  • On-Device Customization: Contrastive learning frameworks (LipLearner) support few-shot command personalization, enabling users to enroll novel, speech or non-speech lip gestures with high F1-score (>0.89 with one-shot registration) directly on mobile (Su et al., 2023).
  • Multi-Task Learning and Security: Integrated authentication and silent spelling decoding (HEar-ID) exploit shared encoder embeddings from commodity earbuds, achieving 67% Top-1 accuracy and low equal error rates (<6.1%) for speaker verification (Dong et al., 18 Dec 2025).

6. Multimodal, Open-Vocabulary, and Contactless SSI

Recent research advances the breadth and flexibility of SSI systems:

  • Open-Vocabulary Recognition: Cross-modal contrastive models and LLM-based scoring adjustment now support single-speaker, open-vocabulary EMG-to-text transcription with WERs below 15% (Benster et al., 2024). Wi-Fi backscatter approaches achieve open-sentence recognition (WER ≈36.9%), nearing the state-of-the-art for vision-based lipreading (Tian et al., 26 Jan 2026).
  • Contactless and Camera-Free Approaches:
    • Acoustic sensing with cosine phase-delta features and Wi-Fi Doppler/TDD tags enable silent speech decoding without on-body devices or cameras. Such systems achieve speaker/environment-independent WERs of 8.4–36.9%, with real-time inference (Luo et al., 2020, Tian et al., 26 Jan 2026).
  • Sentence-Level, Continuous Recognition:
    • Six-axis accelerometer arrays with conformer-CTC models achieve ≈97% accuracy in silent sentence recognition across English and Chinese phrases, handling segmentation, elision, and linking with minimal speaker dependence (Xie et al., 25 Feb 2025).
  • Integration and Personalization: Multimodal pipelines combining visual, muscular, and acoustic features, with user-initiated few-shot adaptation and hands-free activation, are feasible on modern mobile hardware (Su et al., 2023).

7. Limitations, Challenges, and Future Directions

Despite rapid progress, several critical limitations persist:

  • Variability and Generalization: Session-to-session and speaker-to-speaker variability remain major obstacles. Adaptive architectures, domain adversarial learning, and cross-modal pretraining partially address these, but universality across large cohorts has not been demonstrated (Shandiz et al., 2021, Tóth et al., 2023).
  • Vocabularic and Prosodic Coverage: Most SSI systems operate on limited command sets; open-vocabulary and continuous speech remain challenging, especially in speaker-independent settings. Prosody and paralinguistic cues are underexplored due to reduced signal quality in non-acoustic modalities (Ren et al., 25 Aug 2025, Benster et al., 2024).
  • User Comfort and Practicality: Sensor miniaturization, dry-electrode reliability, low-power operation, wearability, and privacy constraints dictate real-world deployment. Graphene-textile sensors, wireless neckbands, and headphone-based EMG arrays demonstrate progress but require further validation over long-term, daily use (Tang et al., 2023, Meier et al., 26 Sep 2025).
  • Ethical, Security, and Clinical Aspects: Secure authentication (e.g., in-ear echo, multimodal embedding) and resistance to spoofing are emerging priorities (Dong et al., 18 Dec 2025). Clinical studies in target populations (e.g., laryngectomy, ALS) and large-scale, open, multi-modal datasets will be essential for field maturity (Gonzalez-Lopez et al., 2020).
  • Integration with LLMs and Paralinguistics: LLMs integrated with SSI pipelines (as in LISA and GER post-processing) dramatically improve open-vocabulary recognition and error correction (Benster et al., 2024, Sivasubramaniam, 2 Sep 2025). The extraction and synthesis of paralinguistic information (affective states, emotion) directly from silent biosignals remains an open research area (Ren et al., 25 Aug 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Silent Speech Interfaces.