Ultrasound Tongue Imaging: Methods & Applications
- Ultrasound Tongue Imaging is a non-invasive technique that visualizes tongue movements in real-time using high-frequency ultrasound and B-mode imaging.
- It employs deep learning models like U-Net and Dense U-Net to automatically extract tongue contours with high accuracy and rapid processing speeds.
- UTI enables practical applications in articulatory phonetics, clinical speech therapy, and silent speech interfaces by integrating robust imaging with multimodal data.
Ultrasound Tongue Imaging (UTI) is a non-invasive imaging technique widely employed in articulatory phonetics, clinical linguistics, and speech technology to visualize and quantify tongue position and dynamics during speech. By emitting high-frequency sound waves submentally and capturing their tissue-specific echoes, UTI produces B-mode images of the midsagittal tongue surface at video frame rates. The resulting data, typically arranged as 2D scanline frames, has proven essential for the study of speech motor control, the development of silent speech interfaces, and actionable feedback in both clinical and second language acquisition contexts.
1. Foundations, Data Acquisition, and Core Preprocessing
UTI systems use curved or linear array probes operating at 2–10 MHz frequencies, placed under the chin to generate midsagittal tongue views. These probes are often stabilized by custom headsets or 3D-printed mounts to minimize motion artefacts. UTI frames typically contain 60–128 scan lines (width) by 400–900 echo samples (depth) per image, at frame rates of 60–120 Hz for articulatory phonetics research (Ribeiro et al., 2021, Zhu et al., 2019, Mozaffari et al., 2019).
Raw frames are subjected to:
- Cropping to discriminate the tongue region from the irrelevant regions (e.g., hyoid, jaw).
- Grayscale normalization to [0,1] or [0,255], optionally followed by mean–variance standardization.
- Spatial resizing (e.g., to 128×128 or 63×412 pixels) for computational uniformity and model compatibility.
- Augmentation (shifts, rotations, flips, zoom) during deep learning to address scarcity and promote generalization, with the most pronounced benefits under low-data regimes (<5,000 frames) (Zhu et al., 2019).
Careful probe stabilization and continuous monitoring of transducer alignment are critical, as session drift or gel loss introduces severe variability. Quantitative monitoring of mean squared error (MSE), structural similarity (SSIM), and complex wavelet SSIM (CW-SSIM) between mean images enables automatic detection of misalignment (e.g., MSE rising above 300 or SSIM below 0.17 marks >95% of misalignments in research datasets) (Csapó et al., 2020).
2. Automated Tongue Contour Extraction: Deep Learning Architectures and Evaluation
Extracting the tongue surface contour from each ultrasound frame is a bottleneck for downstream shape analysis, quantification, and visualization. This process has shifted from labor-intensive manual annotation to fully automated deep learning-based pipelines (Zhu et al., 2019, Mozaffari et al., 2019).
Model Architectures
- U-Net: Encoder–decoder architecture with symmetric skip connections. Encoder path applies repeated (Conv3×3→ReLU→Conv3×3→ReLU→MaxPool2×2) blocks with feature map widths [32,64,128,256,512]. The decoder mirrors this in reverse, concatenating corresponding encoder features, culminating in a 1×1 convolution and sigmoid for a per-pixel probability map.
- Dense U-Net: Based on DenseNet-121 without the classifier top. Four dense blocks with growth rates (k=32) promote feature reuse; transition layers use 1×1 convolution and average pooling. The decoder upsamples via deconv and includes dense sub-blocks with skip connections.
- BowNet and wBowNet: Dual-path models with a UNet-like encoder–decoder for localization and a parallel path of dilated convolutions for global context; wBowNet interleaves the two streams at each block, improving gradient flow and feature sharing (Mozaffari et al., 2019).
Loss Functions
Segmentation objectives combine:
- Binary Cross-Entropy (BCE): Standard pixelwise criterion.
- Dice Loss: Sensitive to class imbalance, measuring overlap between predicted and ground-truth tongue masks.
- Compound Loss: Weighted sum of Dice and BCE, enhancing both pixel accuracy and class-imbalance compensation (e.g., ) (Zhu et al., 2019).
Evaluation and Performance
Performance is predominantly quantified by mean sum of distances (MSD, in pixels or mm) between extracted and true contours. On ≥17,000 same-speaker test frames, U-Net and Dense U-Net achieve MSD of 3.42 and 3.25 px (≈0.85, 0.81 mm), respectively; on cross-speaker test sets, Dense U-Net achieves 5.0 px MSD vs. a 2.79 px benchmark for inter-annotator human error (Zhu et al., 2019).
- U-Net: ~63 fps extraction on consumer GPUs.
- Dense U-Net: ~29 fps; greater generalization across datasets (e.g., French speakers, diverse devices).
- wBowNet: 0.032 mm MSD, 66 fps; BowNet: 0.042 mm, 82 fps (Mozaffari et al., 2019).
Dense connectivity and multi-path fusion both enhance generalization when probe, noise, or subject variability is high.
3. UTI-Based Speech Analysis: Classification and Error Detection
Phonetic Segment Classification
Speaker-adaptive and speaker-independent DNNs (feedforward or CNNs) operating on raw or compressed (eigentongue/DCT) UTI inputs can classify broad phonetic categories (e.g., bilabial, velar, alveolar) in child speech therapy data. CNNs with speaker mean conditioning achieve up to 67% accuracy in the most stringent leave-one-speaker-out setups; 50–100 adaptation frames from a new speaker can recover most generalization loss (Ribeiro et al., 2019). More recently, FusionNet, a dual-branch CNN+texture network, achieved 82.3% precision in speaker-independent classification (substantially outpacing ResNet and Inception baselines) in early-childhood phonetic segment classification (Ani et al., 2024).
Articulation Error Detection
Lightweight CNN+FC architectures trained on both raw ultrasound and audio features, pre-trained on adult data and fine-tuned on child speech, detect segmental errors (e.g., velar fronting) with ≈86.9% accuracy on typical speech and 83.7% on disordered speech, matching "substantial" expert agreement (Cohen’s κ) (Ribeiro et al., 2021). Adding UTI to audio features delivers up to +22 percentage points in binary classification accuracy for specific speech sound errors.
4. UTI for Silent Speech Interfaces and Articulatory-to-Acoustic Mapping
UTI serves as a biosignal for both silent speech interface (SSI) and direct articulatory–acoustic mapping:
- Articulatory-to-acoustic mapping: 2-layer CNNs can predict continuous vocoder parameters (e.g., Mel-generalized cepstrum, continuous F₀, maximum voiced frequency) with an F₀ RMSE of ≈30 Hz—halving the error relative to discontinuous pitch models (Csapó et al., 2019).
- Text-to-articulation prediction: Speaker-dependent FC-DNNs (6-layer, 1,024 units, tanh) trained to predict PCA-projected UTI features (128-dim, capturing 70% variance) from linguistic inputs yield RMSE of 3.4–3.5 on PCA coefficients and enable synthesis of synthetic UTI "videos" for visualization and multimodal feedback in TTS (Csapó, 2021).
- Tacotron2-style adaptation: A pipeline of 3D CNN → symbol embedding → Tacotron2 decoder → WaveGlow vocoder enables ultrasound-only-to-speech synthesis with listening scores (MUSHRA-style) of ≈43, surpassing prior UTI-based baselines (Zainkó et al., 2021).
Denoising convolutional autoencoders (DCAE) also extract robust, compact features useful for silent-speech recognition and outperform DCT and standard AEs on the Silent Speech Challenge recognition task (WER = 6.17% for DCAE, 6.45% for DCT) (Li et al., 2019).
5. Multimodal and Self-Supervised UTI Applications
Recent work leverages UTI in hybrid, multimodal, or weakly-supervised frameworks:
- LLM Integration: UTI-LLM fuses ultrasound-based kinematic trajectories and speech features via spatiotemporal fusion modules into a multimodal LLM, yielding +12–19 percentage point improvements in dysarthria assessment (0.91 accuracy/F1) over strong audio-only LLMs, and delivering actionable, frame-level feedback to patients (Yang et al., 16 Sep 2025).
- Acoustic-to-articulatory inversion: Audio–textual diffusion models embed speaker-specific acoustic cues (wav2vec2.0) and universal phonological content (BERT on ASR transcripts) via cross-attention. This produces high-fidelity synthetic UTI (FID = 22.02 vs. 256.80 for standard DNNs) with sharp contours, enabling downstream analysis and clinical use (Yang et al., 2024).
- Speech–ultrasound cross-modal modeling: Two-stream self-supervised CNN–LSTM architectures reconstruct UTI sequences from synchronized lip video, with a final SSIM of 0.728 and MSD of 4.95 mm, improving over ConvLSTM baselines and requiring no manual contour labels (Liu et al., 2021).
6. Clinical, Pedagogical, and Research Applications
UTI is fundamental for:
- Articulatory phonetics: quantifying dynamic tongue deformation patterns, validating phonological models.
- Speech therapy: enabling visual biofeedback for patients and clinicians, especially in pediatric and post-stroke populations. Automated scoring systems integrated into therapy tools objectively quantify articulatory accuracy and progress (Ribeiro et al., 2021, Yang et al., 16 Sep 2025).
- L2 acquisition: real-time, multimodal visualization systems (e.g., UltraTongue, stabilized probes with deep segmentation) help learners match native tongue postures and trajectories (Mozaffari et al., 2019).
- Speech recognition: UTI-based articulatory features robustly improve ASR performance in elderly, dysarthric, and cross-lingual scenarios—delivering up to 4.75% absolute and 14–22% relative WER reductions when fused with standard acoustic models (Hu et al., 2022, Ribeiro et al., 2019).
Automatic tongue contour extraction—specifically via Dense U-Net, BowNet, and related high-throughput CNNs—has become a necessary base layer for almost every advanced UTI application, enabling efficient downstream modeling and large-scale deployment (Zhu et al., 2019, Mozaffari et al., 2019).
7. Challenges, Model Generalization, and Future Directions
Ongoing challenges include:
- Cross-speaker and cross-domain generalization under varying probe placement, anatomy, and device characteristics. Dense architectures and explicit speaker adaptation/mean conditioning improve robustness, but further advances are needed for true speaker independence (Zhu et al., 2019, Ribeiro et al., 2019).
- Stability under transducer drift or gel loss; integration of automated QA metrics (MSE, SSIM, CW-SSIM) into real-time pipelines is now recommended (Csapó et al., 2020).
- Transition to temporal and 3D modeling: Most deep models process independent frames or short clips; ConvLSTM, 3D CNNs, and multi-slice/fusion systems represent growth areas, promising more accurate coarticulation and dynamic gesture capture (Zainkó et al., 2021, Mozaffari et al., 2019).
Future research directions include real-time, on-device UTI processing and biofeedback, richer self-supervised representation learning across large unlabelled datasets, volumetric/3D UTI, and robust speaker adaptation for diverse populations and clinical pathologies. High-quality, standardized corpora and open-source toolchains remain critical for further progress and reproducibility.
References:
(Zhu et al., 2019, Ribeiro et al., 2021, Csapó, 2021, Li et al., 2019, Ribeiro et al., 2019, Csapó et al., 2023, Yang et al., 16 Sep 2025, Porras et al., 2019, Csapó et al., 2019, Hu et al., 2022, Mozaffari et al., 2019, Liu et al., 2021, Csapó et al., 2020, Xu et al., 2021, Ani et al., 2024, Zainkó et al., 2021, Mozaffari et al., 2019, Ribeiro et al., 2019, Yang et al., 2024)