Conformer with bi-LSTM for Ultrasound-to-Speech
- The paper demonstrates that integrating bi-LSTM layers after Conformer blocks improves perceptual speech naturalness in ultrasound-to-speech conversion.
- It outlines an architecture with stacked Conformer blocks followed by bi-LSTM layers, enabling effective local/global feature extraction and temporal context enforcement.
- The study shows that despite similar objective metrics, the bi-LSTM integration yields significantly higher subjective ratings, emphasizing its practical benefits.
Conformer with bi-LSTM refers to a hybrid neural architecture integrating Conformer blocks with stacked bidirectional Long Short-Term Memory (bi-LSTM) layers. In the context of ultrasound-to-speech conversion—the mapping from sequences of tongue ultrasound scan-lines to speech representations—this architecture offers competitive objective performance with improvements in perceived naturalness measured by subjective metrics. The approach leverages the representational power of Conformer modules for local/global feature modeling and the temporal context integration of bi-LSTMs to enforce sequence consistency, particularly in segments where acoustic cues are subtle or temporally dispersed (Ibrahimov et al., 4 Jun 2025).
1. Architectural Composition
The Conformer with bi-LSTM model processes input ultrasound sequences of shape , where is the temporal dimension (each frame corresponding to a 12 ms window). The architecture consists of:
- An input linear projection
- Four stacked Conformer blocks, each sandwiching Feed-Forward modules, Multi-Head Self-Attention (MHSA) layers (, $32$ heads), and depthwise convolutions () with layer normalization and residual connections
- A subsequent LayerNorm layer
- Two bi-directional LSTM layers (parameter count implies $256$–$320$ units per direction)
- A final linear layer projecting each frame to an $80$-dimensional mel-spectrogram
The encoder-decoder pseudocode is: where each ConformerBlock applies an expansion factor of $3$ in its feed-forward module.
The key architectural distinction is the introduction of the stacked bi-LSTM layers after the final Conformer block, as compared to a Conformer-only "Base" variant.
2. Training Setup and Data Processing
Training employs a mean squared error (MSE) objective between the predicted and reference 0-bin mel-spectrogram frames: 1 where 2.
Optimization utilizes AdamW with an initial learning rate 3, cosine decay restarts (initial cycle 4 steps, each subsequent cycle 5 longer, max LR decays by 6 per cycle), and a minimum LR of 7. Early stopping is based on dev-set MSE with patience 8 epochs (maximum 9 epochs, batch size 0).
Data from the Ultrasuite-Tal80 corpus are pre-processed with ultrasound scanline images downsampled (bicuibc) from 1 to 2, pixel values normalized to 3, and target mel-spectrograms extracted via a Hann window (4-pt FFT, 5-sample hop).
3. Model Complexity and Training Efficiency
Parameter and compute cost comparisons are summarized below:
| Model | Parameters | Relative Training Time |
|---|---|---|
| 2D-CNN baseline | 4.09 M | 100% |
| Conformer Base | 2.66 M | 30% |
| Conformer + bi-LSTM | 5.35 M | 80% |
The Conformer Base achieves a 6 faster training time and 7 fewer parameters than the CNN baseline. The bi-LSTM variant, while larger at 8M parameters, still trains in 9 of the baseline’s time.
A plausible implication is that although adding bi-LSTM increases model size, efficiency remains acceptable for practical development cycles.
4. Objective Performance Metrics
Objective evaluations used MSE and Mel-Cepstral Distortion (MCD, in dB), computed across four speakers on the test set:
Mean Squared Error (MSE)
| Model | 01fi | 02fe | 03mn | 04me |
|---|---|---|---|---|
| 2D-CNN baseline | 0.464 | 0.623 | 0.395 | 0.484 |
| Conformer Base | 0.511 | 0.618 | 0.462* | 0.524 |
| Conformer + bi-LSTM | 0.482 | 0.581 | 0.378 | 0.449 |
*Statistically significant difference ($32$0) for 03mn (Conformer Base vs. baseline).
Mel-Cepstral Distortion (MCD)
$32$1
| Model | 01fi | 02fe | 03mn | 04me |
|---|---|---|---|---|
| 2D-CNN baseline | 3.221 | 3.009 | 3.641 | 3.172 |
| Conformer Base | 3.517* | 3.121 | 4.133* | 3.465* |
| Conformer + bi-LSTM | 3.253 | 3.037 | 3.704 | 3.258* |
*Entries marked by $32$2 denote $32$3 vs. baseline.
Across both metrics, neither conformer-based model demonstrates consistent significant improvements over the baseline for all speakers. Isolated differences are observed (notably, the Base variant performs worse than the baseline for certain metrics/speakers).
5. Subjective Speech Naturalness Evaluation
Subjective perceptual evaluation used a MUSHRA listening test. Twenty utterances (five per speaker) were rated by $32$4 listeners on a 0-100 scale, with the following average naturalness scores:
| System | Avg. Score |
|---|---|
| Lower anchor (noisy) | ~18 |
| 2D-CNN baseline | ~49 |
| Conformer Base | ~49 |
| Conformer + bi-LSTM | ~54 |
| Natural speech | ~92 |
Paired Wilcoxon tests revealed:
- bi-LSTM vs. baseline: $32$5 (significant improvement)
- Base vs. baseline: $32$6 (no difference)
- bi-LSTM vs. Base: $32$7 (bi-LSTM better)
The Conformer + bi-LSTM variant thus delivers a perceptually significant gain, despite lacking clear superiority in objective metrics.
6. Interpretive Comparison and Significance
The bi-LSTM integration provides additional sequence modeling capacity by incorporating both past and future context at the frame level, which smooths prediction errors—especially in silent/unvoiced frames and during formant transitions. While Conformer blocks are effective for local and global context extraction, the inclusion of bi-LSTM layers enforces temporal consistency, leading to fewer artifacts and improved perceptual quality.
This suggests that, in the ultrasound-to-speech setting, hybrid architectures combining MHSA-based local/global feature modeling with recurrent temporal aggregation may be especially valuable when the evaluation focus is perceptual rather than solely objective.
7. Discussion and Implications
The Conformer with bi-LSTM outperforms both the Conformer Base and 2D-CNN baselines in subjective measures without incurring excessive training cost. A plausible implication is that further perceptual gains may be achievable by combining attention-based and recurrent modules in sequence-to-sequence acoustic mapping domains characterized by sparse or ambiguous local structure, such as silent speech interfaces.
Objective metrics may be insufficient for detecting perceptual quality improvements introduced by temporally global modeling, highlighting the necessity for human-in-the-loop evaluation in such systems. Future work could investigate architectural variants, alternative regularizations, or domain-specific pretraining within this paradigm (Ibrahimov et al., 4 Jun 2025).