MONA: Multimodal Orofacial Neural Audio
- Multimodal Orofacial Neural Audio (MONA) is a framework that fuses diverse sensor data (EMA, EMG, audio, video, NAM) with deep learning to enhance speech-related tasks.
- MONA employs modality-specific encoders and adaptive fusion strategies with joint optimization, ensuring robust performance in speech synthesis, recognition, and enhancement.
- The system demonstrates improved noise robustness, silent speech conversion, and clinical interpretability, paving the way for silent speech interfaces and expressive animation.
Multimodal Orofacial Neural Audio (MONA) synthesizes and interprets speech and orofacial behavior by fusing diverse physiological, audio, and neural signals with modern deep learning. MONA frameworks, as demonstrated in recent literature, address the limitations of unimodal systems by integrating heterogeneous sensor data streams—such as electromagnetic articulography (EMA), surface electromyography (EMG), body-conducted vibrations, lip video, and conventional audio—within end-to-end architectures for speech reconstruction, recognition, animation, enhancement, and clinical assessment. Core innovations include modality-specific encoders, advanced cross-modal fusion mechanisms, joint optimization using multimodal loss functions, and robustness to missing or degraded modalities. This approach yields substantial gains across noise robustness, silent speech, expressive talking-head synthesis, sensor fault tolerance, and interpretable clinical inference.
1. Multimodal Sensing: Modalities and Data Synchronization
MONA systems operate on a diverse repertoire of input streams, tightly synchronized at the sensor level:
- Electromagnetic Articulography (EMA): Magnetic coil displacement vectors sampled at 200–250 Hz from articulators (tongue tip/blade/dorsum/rear, upper/lower lip, jaw, velum), providing low-dimensional but time-resolved geometric information (Chen et al., 2021).
- Surface EMG: Multi-channel (commonly 6–35 electrodes) facial and neck EMG at 1–2 kHz, processed into band-split, context-stacked amplitude features to capture muscle activation underlying articulatory gestures (Wang et al., 2022, Benster et al., 2024, Zhou et al., 28 Jan 2025).
- Audio/Microphones: Airborne and body-conduction microphones, capturing speech waveform; processed to log-mel or other time-frequency representations (Chen et al., 2021, Kim et al., 24 Aug 2025).
- Video: Lip-ROI or full-face grayscale/RGB video at ∼25–30 fps, typically cropped, grayscale-converted, and normalized for articulator position and motion (Zhou et al., 28 Jan 2025, Shah et al., 2024).
- Non-Audible Murmur (NAM): High-bandwidth stethoscope placed under the chin, providing sub- and supra-glottal vibration signals (Shah et al., 2024).
- Text/Phonetic Transcripts: Used for articulation-to-speech aligning, synthesis, and evaluation (Steiner et al., 2016, Shah et al., 2024).
Synchronized acquisition typically involves hardware or software triggers, with either manual alignment or end-to-end learned fusion. Factors such as window size, normalization, per-modality augmentation, and artifact rejection are critical for effective downstream integration (Zhou et al., 28 Jan 2025).
2. Core Architecture: Encoders, Fusion, and Joint Training
MONA pipelines adopt modular, modality-specific neural encoders and sophisticated fusion strategies:
- Modality-Specific Encoders: Bidirectional LSTMs for EMA (Chen et al., 2021); 1D/2D/3D convolutions and residual structures for EMG/audio/video (Wang et al., 2022, Zhou et al., 28 Jan 2025); foundation models (e.g., DeepSeek-VL2, TRILLsson) in clinical tasks (Akhtar et al., 11 Jan 2026).
- Fusion Mechanisms:
- Feature-level concatenation with downstream temporal modeling (Bi-GRU, Transformer) (Zhou et al., 28 Jan 2025).
- Late fusion: Per-modality encoders followed by concatenation or gating (Wang et al., 2022).
- Adaptive gating: Sparse gates dynamically weight modalities based on context or reliability, enabling robust handling of missing or degraded inputs (Akhtar et al., 11 Jan 2026).
- Noise-adaptive fusion: Explicit fusion weighting informed by local SNR or network mask outputs, as in BAF-Net (Kim et al., 24 Aug 2025).
- Multi-modal adaptive normalization (MAN): Parameter sharing of normalization statistics among multiple audio and video features, controlled by learned gating (Kumar et al., 2020).
- Cycle alignment and hierarchical VAEs: Latent alignment and disentanglement to isolate shared vs. private representations (Akhtar et al., 11 Jan 2026).
- Loss Functions:
- Joint spectrogram, mel-spectrogram, and deep feature losses for articulatory-to-speech training (Chen et al., 2021).
- Cross-modal contrastive objectives align latent spaces across modalities and datasets (Benster et al., 2024).
- Task-specific L1 or L2 reconstruction, adversarial, CTC, and classification losses for speech quality, intelligibility, and downstream predictive tasks (Chen et al., 2021, Wang et al., 2022, Shah et al., 2024).
Stage-wise or end-to-end joint optimization protocols are used, with curriculum strategies for complicated architectures such as noise-adaptive fusion or hierarchical disentanglement (Kim et al., 24 Aug 2025, Akhtar et al., 11 Jan 2026).
3. Advances in Speech Synthesis, Recognition, and Enhancement
MONA frameworks substantially advance several task domains:
- Articulatory-to-Speech Synthesis: EMA2S demonstrates direct conversion of mid-sagittal EMA trajectories to audio via joint LSTM encoders, mel-spectrogram mapping, and neural vocoders (Parallel WaveGAN). Multimodal joint training regularizes the low-dimensional articulatory stream against high-resource audio, improving synthesized speech fidelity and naturalness (Chen et al., 2021).
- Silent Speech & NAM Speech: Integration of NAM, EMG, lip video, and phonetic transcript with diffusion-based and TTS ground-truth simulation (MultiNAM, Diff-NAM) enables fully silent or nearly silent speech-to-audio conversion, robust to missing vocal tract sound (Shah et al., 2024, Benster et al., 2024). MONA LISA achieves 12.2% WER on the Gaddy benchmark for silent EMG speech—exceeding the prior state-of-the-art by more than 16 points (Benster et al., 2024).
- Speech Enhancement: Systems fusing EMG and audio enhance noisy speech more effectively than audio-only SE approaches, with largest gains under extreme SNR and unseen noise. Cheek-only EMG channels are sufficient for high performance, simplifying sensor setup (Wang et al., 2022). Similarly, BMS/AMS adaptive fusion achieves substantial PESQ and STOI gains over best unimodal baselines, especially at low SNR (Kim et al., 24 Aug 2025).
- Multimodal Speech Recognition: Large-scale fusion of audio, video (lip-ROI), and EMG on corpus-level benchmarks yields >95% accuracy, with marked robustness to noise and inter-subject variability. Transformer fusion outperforms naive concatenation and attention-based weighting yields further gains under unbalanced conditions (Zhou et al., 28 Jan 2025).
4. Expressive Talking-Head Generation and 3D Orofacial Modeling
MONA also encompasses neural audio-driven facial motion and video synthesis:
- Audio-to-3D and 2D Animation: Adaptive normalization and dual-attention networks fuse speech prosody, pitch, energy, optical flow, and facial keypoints to generate temporally coherent, high-fidelity talking heads. Ablation studies confirm the necessity of fusing multiple modalities for synchronous, expressive, and anatomically plausible animations (Kumar et al., 2020, Liu et al., 2023).
- 3D Orofacial Models: Multilinear statistical shape models derived from MRI synchronize tongue/jaw/lip motion with synthesized audio, facilitating computer-assisted pronunciation and real-time speech therapy (Steiner et al., 2016).
- Dense Landmark Recovery: Compositional facial landmark generators and temporally guided U-Nets translate audio or latent orofacial features into detailed facial mesh sequences (Liu et al., 2023).
Performance metrics include SSIM, PSNR, WER (lip sync accuracy), LMD, blink rate, and human user studies, with multimodal fusion models outperforming single-modality and prior art across all evaluated axes.
5. Robustness, Adaptivity, and Clinical Interpretability
MONA frameworks exhibit resilience to degraded or missing input streams and support interpretability:
- Sensor Fault Tolerance: Systems are robust to missing sensors (e.g., using four instead of nine EMA coils) or partially occluded EMG grids, due to adaptive fusion, context stacking, and auxiliary losses (Chen et al., 2021, Wang et al., 2022).
- Noise-Adaptive Processing: Modality reliability is assessed on-the-fly, allowing the model to weight robust channels (e.g., BMS under high noise or EMG during speech occlusion) and downregulate unreliable ones (Kim et al., 24 Aug 2025).
- Graceful Modality Dropout: Hierarchical VAEs and gating structures are designed such that, at inference, in the absence of one modality, the model simplifies to using the available evidence with measured accuracy degradation (Akhtar et al., 11 Jan 2026).
- Clinical Interpretability: Disentangling shared vs. private latents, coupled with learnable “symptom tokens,” enables direct mapping of model predictions to clinical attributes (e.g., symmetry, speed, variability), facilitating transparent diagnostic or severity scoring (Akhtar et al., 11 Jan 2026).
Ablation analyses underscore that removal of cycle alignment, gating, or token specialization degrades accuracy, highlighting their importance for both robustness and explainability.
6. Benchmarks, Evaluation, and Quantitative Results
Extensive evaluation over public and custom datasets quantifies MONA’s advantages:
| Study / Task | Modalities | Noise Robustness | Key Metrics & Results |
|---|---|---|---|
| EMA2S Artic-to-Speech (Chen et al., 2021) | EMA, audio | Clean, (ablation: 4 vs. 9 EMA) | MCD 7.176, PESQ 1.35, STOI 0.716, CCR 0.868 (EMA2S outperforms baseline), 83% listener preference |
| AVE Speech Recog. (Zhou et al., 28 Jan 2025) | Audio, video, EMG | Down to –10 dB SNR (audio); cross-subject | Transformer-fusion: 99.87% accuracy (clean), 95.51% (–10 dB), EMG 75.53%, video 98.55% (unimodal) |
| EMGSE Speech Enhancement (Wang et al., 2022) | Audio, EMG | SNR to –11 dB, speech noise/unseen noise | PESQ: 1.991 (EMG+audio), 0.2–0.25 > audio-only; STOI: 0.691 EMG+audio; cheek-only channel ≈ full grid |
| BAF-Net (Kim et al., 24 Aug 2025) | AMS, BMS | SNR –20 to +15 dB w/ DNS noise, reverb | At –20 dB: PESQ 2.082, STOI 0.904; at +15 dB: PESQ 2.875. Outperforms best unimodal AMS/BMS at all SNRs |
| MONA LISA Silent Speech (Benster et al., 2024) | EMG, audio, text | Silent (no audio), open-vocab; competition tasks | Silent EMG WER: 12.2% (prev. best: 28.8%), vocal EMG: 3.7%, Brain2Text: 8.9% (prev.: 9.8%) |
| Diff-NAM Lip+NAM Speech (Shah et al., 2024) | NAM, lips, text | Silent speech, no voiced sound | WER: 17.2% (S₁), 21.7% (S₂) for lip+NAM+text diffusion, compared to >30% for lip alone, >140% for Mspec-Net |
| DIVINE Clinical (Akhtar et al., 11 Jan 2026) | Audio, video | Synchronized full/noisy/missing modalities | Full: 98.26% acc. (DeepSeek-VL2+TRILLsson), video-only: 89.3%, audio-only: 84.3%, interpretable symptom tokens |
Such systems consistently demonstrate that joint multimodal learning, compared to unimodal or naive multimodal baselines, yields high signal fidelity, lower error rates, and greater robustness under noise, occlusion, and population variability.
7. Extensions, Open Issues, and Future Directions
Research proposes several directions to extend MONA:
- Additional Modalities: Incorporation of ultrasound tongue imaging, EEG, or fNIRS for cases where EMG/NAM are not available or patients cannot murmur (Chen et al., 2021, Shah et al., 2024).
- Advances in Diffusion/Generative Models: Fast or real-time distillations of diffusion-based generators for deployment scenarios (Shah et al., 2024).
- Text/LLMs: Integration of large pre-trained language or articulatory models (Causal Transformers) for richer prosody and semantic control, as well as LLM-based rescoring for open-vocabulary recognition (Benster et al., 2024).
- Self-supervised/Low-Resource Adaptation: Improved phoneme alignment and transfer learning for generalizing to unseen speakers and low-resource datasets (Shah et al., 2024).
- Deployment: Sensor selection and miniaturization for fieldable silent speech devices; lightweight fusion architectures for real-time edge processing (Wang et al., 2022, Kim et al., 24 Aug 2025).
A plausible implication is that the continued unification of diverse orofacial and neural streams under robust, interpretable multimodal architectures will enable both next-generation silent speech prostheses and highly expressive virtual agents, while simultaneously advancing clinical monitoring and assessment of orofacial function.