Conformer with bi-LSTM for Ultrasound-to-Speech

Updated 13 May 2026

The paper demonstrates that integrating bi-LSTM layers after Conformer blocks improves perceptual speech naturalness in ultrasound-to-speech conversion.
It outlines an architecture with stacked Conformer blocks followed by bi-LSTM layers, enabling effective local/global feature extraction and temporal context enforcement.
The study shows that despite similar objective metrics, the bi-LSTM integration yields significantly higher subjective ratings, emphasizing its practical benefits.

Conformer with bi-LSTM refers to a hybrid neural architecture integrating Conformer blocks with stacked bidirectional Long Short-Term Memory (bi-LSTM) layers. In the context of ultrasound-to-speech conversion—the mapping from sequences of tongue ultrasound scan-lines to speech representations—this architecture offers competitive objective performance with improvements in perceived naturalness measured by subjective metrics. The approach leverages the representational power of Conformer modules for local/global feature modeling and the temporal context integration of bi-LSTMs to enforce sequence consistency, particularly in segments where acoustic cues are subtle or temporally dispersed (Ibrahimov et al., 4 Jun 2025).

1. Architectural Composition

The Conformer with bi-LSTM model processes input ultrasound sequences of shape $(T, 64)$ , where $T$ is the temporal dimension (each frame corresponding to a 12 ms window). The architecture consists of:

An input linear projection
Four stacked Conformer blocks, each sandwiching Feed-Forward modules, Multi-Head Self-Attention (MHSA) layers ( $d_\mathrm{model}=256$ , $32$ heads), and depthwise convolutions ( $\mathrm{kernel\_size}=31$ ) with layer normalization and residual connections
A subsequent LayerNorm layer
Two bi-directional LSTM layers (parameter count implies $256$–$320$ units per direction)
A final linear layer projecting each frame to an $80$-dimensional mel-spectrogram

The encoder-decoder pseudocode is: $\begin{aligned} \mathbf{X}_0 &= \mathrm{Linear}_\mathrm{in}(\mathrm{UTIF}) \ \mathbf{X}_i &= \mathrm{ConformerBlock}(\mathbf{X}_{i-1}),\quad i=1\dots4 \ \mathbf{H} &= \mathrm{LayerNorm}(\mathbf{X}_4) \ \mathbf{H}' &= \mathrm{BiLSTM}_2(\mathbf{H}) \ \widehat{\mathbf{Y}}_t &= \mathrm{Linear}_\mathrm{out}(\mathbf{H}'_t), \quad t=1\dots T \end{aligned}$ where each ConformerBlock applies an expansion factor of $3$ in its feed-forward module.

The key architectural distinction is the introduction of the stacked bi-LSTM layers after the final Conformer block, as compared to a Conformer-only "Base" variant.

2. Training Setup and Data Processing

Training employs a mean squared error (MSE) objective between the predicted and reference $T$ 0-bin mel-spectrogram frames: $T$ 1 where $T$ 2.

Optimization utilizes AdamW with an initial learning rate $T$ 3, cosine decay restarts (initial cycle $T$ 4 steps, each subsequent cycle $T$ 5 longer, max LR decays by $T$ 6 per cycle), and a minimum LR of $T$ 7. Early stopping is based on dev-set MSE with patience $T$ 8 epochs (maximum $T$ 9 epochs, batch size $d_\mathrm{model}=256$ 0).

Data from the Ultrasuite-Tal80 corpus are pre-processed with ultrasound scanline images downsampled (bicuibc) from $d_\mathrm{model}=256$ 1 to $d_\mathrm{model}=256$ 2, pixel values normalized to $d_\mathrm{model}=256$ 3, and target mel-spectrograms extracted via a Hann window ( $d_\mathrm{model}=256$ 4-pt FFT, $d_\mathrm{model}=256$ 5-sample hop).

3. Model Complexity and Training Efficiency

Parameter and compute cost comparisons are summarized below:

Model	Parameters	Relative Training Time
2D-CNN baseline	4.09 M	100%
Conformer Base	2.66 M	30%
Conformer + bi-LSTM	5.35 M	80%

The Conformer Base achieves a $d_\mathrm{model}=256$ 6 faster training time and $d_\mathrm{model}=256$ 7 fewer parameters than the CNN baseline. The bi-LSTM variant, while larger at $d_\mathrm{model}=256$ 8M parameters, still trains in $d_\mathrm{model}=256$ 9 of the baseline’s time.

A plausible implication is that although adding bi-LSTM increases model size, efficiency remains acceptable for practical development cycles.

4. Objective Performance Metrics

Objective evaluations used MSE and Mel-Cepstral Distortion (MCD, in dB), computed across four speakers on the test set:

Mean Squared Error (MSE)

Model	01fi	02fe	03mn	04me
2D-CNN baseline	0.464	0.623	0.395	0.484
Conformer Base	0.511	0.618	0.462*	0.524
Conformer + bi-LSTM	0.482	0.581	0.378	0.449

*Statistically significant difference ($32$0) for 03mn (Conformer Base vs. baseline).

Mel-Cepstral Distortion (MCD)

$32$1

Model	01fi	02fe	03mn	04me
2D-CNN baseline	3.221	3.009	3.641	3.172
Conformer Base	3.517*	3.121	4.133*	3.465*
Conformer + bi-LSTM	3.253	3.037	3.704	3.258*

*Entries marked by $32$2 denote $32$3 vs. baseline.

Across both metrics, neither conformer-based model demonstrates consistent significant improvements over the baseline for all speakers. Isolated differences are observed (notably, the Base variant performs worse than the baseline for certain metrics/speakers).

5. Subjective Speech Naturalness Evaluation

Subjective perceptual evaluation used a MUSHRA listening test. Twenty utterances (five per speaker) were rated by $32$4 listeners on a 0-100 scale, with the following average naturalness scores:

System	Avg. Score
Lower anchor (noisy)	~18
2D-CNN baseline	~49
Conformer Base	~49
Conformer + bi-LSTM	~54
Natural speech	~92

Paired Wilcoxon tests revealed:

bi-LSTM vs. baseline: $32$5 (significant improvement)
Base vs. baseline: $32$6 (no difference)
bi-LSTM vs. Base: $32$7 (bi-LSTM better)

The Conformer + bi-LSTM variant thus delivers a perceptually significant gain, despite lacking clear superiority in objective metrics.

6. Interpretive Comparison and Significance

The bi-LSTM integration provides additional sequence modeling capacity by incorporating both past and future context at the frame level, which smooths prediction errors—especially in silent/unvoiced frames and during formant transitions. While Conformer blocks are effective for local and global context extraction, the inclusion of bi-LSTM layers enforces temporal consistency, leading to fewer artifacts and improved perceptual quality.

This suggests that, in the ultrasound-to-speech setting, hybrid architectures combining MHSA-based local/global feature modeling with recurrent temporal aggregation may be especially valuable when the evaluation focus is perceptual rather than solely objective.

7. Discussion and Implications

The Conformer with bi-LSTM outperforms both the Conformer Base and 2D-CNN baselines in subjective measures without incurring excessive training cost. A plausible implication is that further perceptual gains may be achievable by combining attention-based and recurrent modules in sequence-to-sequence acoustic mapping domains characterized by sparse or ambiguous local structure, such as silent speech interfaces.

Objective metrics may be insufficient for detecting perceptual quality improvements introduced by temporally global modeling, highlighting the necessity for human-in-the-loop evaluation in such systems. Future work could investigate architectural variants, alternative regularizations, or domain-specific pretraining within this paradigm (Ibrahimov et al., 4 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Conformer-based Ultrasound-to-Speech Conversion (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformer with bi-LSTM.

Conformer with bi-LSTM for Ultrasound-to-Speech

1. Architectural Composition

2. Training Setup and Data Processing

3. Model Complexity and Training Efficiency

4. Objective Performance Metrics

Mean Squared Error (MSE)

Mel-Cepstral Distortion (MCD)

5. Subjective Speech Naturalness Evaluation

6. Interpretive Comparison and Significance

7. Discussion and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Conformer with bi-LSTM for Ultrasound-to-Speech

1. Architectural Composition

2. Training Setup and Data Processing

3. Model Complexity and Training Efficiency

4. Objective Performance Metrics

Mean Squared Error (MSE)

Mel-Cepstral Distortion (MCD)

5. Subjective Speech Naturalness Evaluation

6. Interpretive Comparison and Significance

7. Discussion and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research