SenseVoice: Multimodal Speech & Silent Speech Tech
- SenseVoice is a multimodal technology combining open-source ASR models with silent-speech interfaces that utilize ultrasound imaging.
- It employs advanced deep neural networks, including transformer and CNN architectures, to process both audio and articulatory signals with ultra-low latency.
- The system supports multilingual, emotion-aware applications and noise-immune human–AI interaction, setting a new standard in voice-centric interfaces.
SenseVoice encompasses two distinct but complementary technological lines: (1) an open-source, end-to-end multimodal automatic speech understanding model family for large-scale multilingual applications, serving as the speech-understanding backbone of FunAudioLLM (An et al., 2024); and (2) a generalized framework for silent-speech interfaces using ultrasound imaging and deep neural networks, illustrated by SottoVoce (Kimura et al., 2023) and extrapolated for future wearable and noise-immune human–AI voice control. These approaches collectively advance the ability of machines to robustly perceive, transcribe, and interpret human vocal or articulatory signals, both in conventional voiced and unvoiced (“silent speech”) scenarios, enabling granular language, emotion, and audio event understanding suitable for downstream LLM-based reasoning.
1. Model Architectures and Modalities
SenseVoice within the FunAudioLLM ecosystem includes two principal instantiations:
- SenseVoice-Small: A non-autoregressive, encoder-only architecture built on memory-equipped self-attention (SAN-M) blocks. After extracting 80-dimensional log-Mel filter-bank features and down-sampling by a factor of 6, four learned task tokens——are prepended. The SAN-M encoder stack outputs logits for automatic speech recognition (ASR), language identification (LID), speech emotion recognition (SER), and audio event classification (AEC).
- SenseVoice-Large: An autoregressive encoder–decoder following a Whisper-style transformer architecture. Input audio is processed into log-Mel features and fed to a multi-layer transformer encoder, with a decoder predicting sequences of tokens, beginning with explicit control tokens specifying the tasks. This model extends coverage to over 50 languages and incorporates control over task-specific output.
In the silent-speech domain, the SottoVoce prototype illustrates a “SenseVoice” pipeline for reconstructing Mel-spectrograms directly from ultrasound images of the vocal tract (Kimura et al., 2023). This employs:
- Ultrasound Articulator Sensing: A convex 3.5 MHz transducer mounted beneath the chin captures dynamic tongue and oral cavity movement. Image preprocessing involves cropping to the articulator region, intensity normalization, and optional speckle denoising.
- Deep Neural Network Staging:
- Convolutional network maps ultrasound stacks (≈400 ms context) to Mel-spectral vectors.
- Subsequent 1D temporal convolution and U-Net processing further refine Mel sequences, capturing multi-scale articulation dynamics.
- Audio Synthesis: Inverse spectrogram techniques (e.g., Griffin–Lim) render the Mel predictions as audible waveforms for control or feedback.
2. Training Data, Preprocessing, and Loss Functions
SenseVoice-Small is trained on a 300,000-hour corpus spanning Chinese (ZH), Cantonese (Yue), English (EN), Japanese (JP), and Korean (KO), whereas SenseVoice-Large includes an additional 100,000 hours covering 50+ languages, with substantial data for auxiliary tasks: 150 million audio clips for AED and 30 million for SER, pseudo-labeled via open-source models (An et al., 2024). Data cleaning includes SNR-based filtering, voice activity detection, and force alignment.
Training is multi-task, utilizing cross-entropy loss () for control token classification and connectionist temporal classification (CTC) loss () for ASR:
For ultrasound-based silent-speech, dataset creation requires speaker-specific modeling (≈500 labeled utterances per subject) and careful alignment of ultrasound to audio features. Mel feature targets are extracted from voiced examples for supervised training using mean-squared error (MSE) as the objective function (Kimura et al., 2023).
3. Evaluation Metrics and Comparative Performance
SenseVoice’s evaluation encompasses:
- ASR: Character Error Rate (CER) for CJK languages and Word Error Rate (WER) for others.
- SenseVoice-Small achieves 2.96 % CER on AISHELL-1 and SenseVoice-Large 2.09 %, outperforming Whisper-Large-V3 (5.14 %) (An et al., 2024).
- Real-Time Factor (RTF) for SenseVoice-Small is 0.007 (70 ms latency per 10 s utterance), 5–15× faster than Whisper baselines.
- Speech Emotion Recognition: Metrics include Unweighted Accuracy (UA), Weighted Accuracy (WA), macro-F1, and weighted F1.
- SenseVoice-Large achieves >93 % WA on CASIA and sets new state-of-the-art across seven emotion datasets without task-specific fine-tuning.
- Audio Event Detection: F1 score across multiple benchmarks (e.g., ESC-50, Coswara cough detection). SenseVoice matches or closely trails audio-only models (BEATs, PANNs).
For SottoVoce (silent-speech), audio reconstruction accuracy is assessed by smart speaker command recognition (65.0 % for telco commands), spectrogram similarity, and end-to-end WER. Silent-speech utterances showed meaningful intelligibility—improved further with human–AI co-adaptation through user feedback (Kimura et al., 2023).
4. System Integration and Downstream Application
SenseVoice models function as perceptual front-ends to LLM-driven pipelines (An et al., 2024), emitting text tokens and task-specific meta-information (LID, emotion, audio events). These outputs serve as input to LLMs for compositional reasoning and generation tasks, enabling:
- Speech-to-speech translation: Transcription and language tagging by SenseVoice drive LLM-based translation, followed by CosyVoice for expressive, cross-lingual TTS.
- Emotion-aware dialog systems: Emission of transcription and SER tags allows the LLM to generate style-conditioned responses, synthesized by CosyVoice.
- Interactive audio grounding: Event and timestamp tokens from SenseVoice support multimodal LLMs in structured scene understanding or targeted dialog referencing audio segments.
Silent-speech "SenseVoice" systems generalize this vision to noise-immune, privacy-preserving interfaces, robust to environmental sound and suited to medical applications, using ultrasound articulation as the input modality (Kimura et al., 2023). Audio output can be generated at low latency for direct device control or feedback.
5. Comparative Model Properties: Small vs. Large, Voiced vs. Silent
| Variant/Modality | Main Features | Languages/Latency |
|---|---|---|
| SenseVoice-Small | Encoder-only, SAN-M, non-autoregressive; ultra-low latency | 5 (ZH, Yue, EN, JP, KO); 70 ms latency (10 s utterance); 234M params |
| SenseVoice-Large | Encoder–decoder, transformer, autoregressive, high-accuracy | 50+ languages; 1.6 s latency (10 s utterance); 1.587B params |
| SottoVoce (Silent) | Two-stage CNN/Conv1D U-Net; ultrasound to Mel-spectra | Personalized; 2.61 s for 3.68 s input; speaker-dependent |
SenseVoice-Small is optimized for sub-100 ms inference suitable for real-time dialog, whereas SenseVoice-Large prioritizes maximal language coverage and ASR accuracy. SenseVoice-Large is uniquely capable in low-resource/codeswitched languages and in joint language–emotion–event extraction. SottoVoce’s architecture is currently restricted to single-speaker, personalized modeling but demonstrates the feasibility of silent-speech control.
6. Limitations and Prospects for Future Development
Current limitations include:
- Streamability: SenseVoice is not fully streamable; modularity, including early-exit mechanisms and partial-frame processing, could support streaming in future iterations (An et al., 2024).
- User adaptation: SottoVoce-based silent-speech requires per-user retraining; broader SenseVoice platforms could include rapid speaker adaptation or multi-modal fusion (EMG, IMU, lip camera) for user-independent modeling (Kimura et al., 2023).
- Form factor and safety: Current ultrasound hardware is bulky and not integrated. A full SenseVoice realization would utilize miniaturized, wearable ultrasound ASICs with safe, event-triggered activation.
- Clinical translation: SenseVoice frameworks illustrate the possibility of silent-speech restoration for individuals with vocal-fold impairments, contingent on improvements in robustness and user-initiated retraining.
A plausible implication is that, as LLMs and multi-modal foundation models become more tightly coupled with sensor-fusion and edge inference hardware, SenseVoice-type architectures will enable context-rich, privacy-preserving, and broadly accessible speech interfaces in both voiced and silent settings.
7. Significance in Foundation Model Ecosystem
SenseVoice defines the interface between raw sensor input—either audio (for voiced speech) or articulation (for silent speech)—and semantic reasoning over transcripts, paralinguistics, and environmental events, all within a scalable foundation model framework. Its architecture and multi-task training paradigm set a precedent for future work in robust, ultra-low-latency, multilingual, and multi-modal human–machine verbal interaction (An et al., 2024, Kimura et al., 2023). The open-source release on Modelscope, Huggingface, and GitHub positions SenseVoice as a cornerstone technology for real-time, emotionally-aware, cross-lingual, and privacy-preserving voice-centric AI applications.