Papers
Topics
Authors
Recent
Search
2000 character limit reached

Conformer with bi-LSTM for Ultrasound-to-Speech

Updated 13 May 2026
  • The paper demonstrates that integrating bi-LSTM layers after Conformer blocks improves perceptual speech naturalness in ultrasound-to-speech conversion.
  • It outlines an architecture with stacked Conformer blocks followed by bi-LSTM layers, enabling effective local/global feature extraction and temporal context enforcement.
  • The study shows that despite similar objective metrics, the bi-LSTM integration yields significantly higher subjective ratings, emphasizing its practical benefits.

Conformer with bi-LSTM refers to a hybrid neural architecture integrating Conformer blocks with stacked bidirectional Long Short-Term Memory (bi-LSTM) layers. In the context of ultrasound-to-speech conversion—the mapping from sequences of tongue ultrasound scan-lines to speech representations—this architecture offers competitive objective performance with improvements in perceived naturalness measured by subjective metrics. The approach leverages the representational power of Conformer modules for local/global feature modeling and the temporal context integration of bi-LSTMs to enforce sequence consistency, particularly in segments where acoustic cues are subtle or temporally dispersed (Ibrahimov et al., 4 Jun 2025).

1. Architectural Composition

The Conformer with bi-LSTM model processes input ultrasound sequences of shape (T,64)(T, 64), where TT is the temporal dimension (each frame corresponding to a 12 ms window). The architecture consists of:

  • An input linear projection
  • Four stacked Conformer blocks, each sandwiching Feed-Forward modules, Multi-Head Self-Attention (MHSA) layers (dmodel=256d_\mathrm{model}=256, $32$ heads), and depthwise convolutions (kernel_size=31\mathrm{kernel\_size}=31) with layer normalization and residual connections
  • A subsequent LayerNorm layer
  • Two bi-directional LSTM layers (parameter count implies $256$–$320$ units per direction)
  • A final linear layer projecting each frame to an $80$-dimensional mel-spectrogram

The encoder-decoder pseudocode is: X0=Linearin(UTIF) Xi=ConformerBlock(Xi−1),i=1…4 H=LayerNorm(X4) H′=BiLSTM2(H) Y^t=Linearout(Ht′),t=1…T\begin{aligned} \mathbf{X}_0 &= \mathrm{Linear}_\mathrm{in}(\mathrm{UTIF}) \ \mathbf{X}_i &= \mathrm{ConformerBlock}(\mathbf{X}_{i-1}),\quad i=1\dots4 \ \mathbf{H} &= \mathrm{LayerNorm}(\mathbf{X}_4) \ \mathbf{H}' &= \mathrm{BiLSTM}_2(\mathbf{H}) \ \widehat{\mathbf{Y}}_t &= \mathrm{Linear}_\mathrm{out}(\mathbf{H}'_t), \quad t=1\dots T \end{aligned} where each ConformerBlock applies an expansion factor of $3$ in its feed-forward module.

The key architectural distinction is the introduction of the stacked bi-LSTM layers after the final Conformer block, as compared to a Conformer-only "Base" variant.

2. Training Setup and Data Processing

Training employs a mean squared error (MSE) objective between the predicted and reference TT0-bin mel-spectrogram frames: TT1 where TT2.

Optimization utilizes AdamW with an initial learning rate TT3, cosine decay restarts (initial cycle TT4 steps, each subsequent cycle TT5 longer, max LR decays by TT6 per cycle), and a minimum LR of TT7. Early stopping is based on dev-set MSE with patience TT8 epochs (maximum TT9 epochs, batch size dmodel=256d_\mathrm{model}=2560).

Data from the Ultrasuite-Tal80 corpus are pre-processed with ultrasound scanline images downsampled (bicuibc) from dmodel=256d_\mathrm{model}=2561 to dmodel=256d_\mathrm{model}=2562, pixel values normalized to dmodel=256d_\mathrm{model}=2563, and target mel-spectrograms extracted via a Hann window (dmodel=256d_\mathrm{model}=2564-pt FFT, dmodel=256d_\mathrm{model}=2565-sample hop).

3. Model Complexity and Training Efficiency

Parameter and compute cost comparisons are summarized below:

Model Parameters Relative Training Time
2D-CNN baseline 4.09 M 100%
Conformer Base 2.66 M 30%
Conformer + bi-LSTM 5.35 M 80%

The Conformer Base achieves a dmodel=256d_\mathrm{model}=2566 faster training time and dmodel=256d_\mathrm{model}=2567 fewer parameters than the CNN baseline. The bi-LSTM variant, while larger at dmodel=256d_\mathrm{model}=2568M parameters, still trains in dmodel=256d_\mathrm{model}=2569 of the baseline’s time.

A plausible implication is that although adding bi-LSTM increases model size, efficiency remains acceptable for practical development cycles.

4. Objective Performance Metrics

Objective evaluations used MSE and Mel-Cepstral Distortion (MCD, in dB), computed across four speakers on the test set:

Mean Squared Error (MSE)

Model 01fi 02fe 03mn 04me
2D-CNN baseline 0.464 0.623 0.395 0.484
Conformer Base 0.511 0.618 0.462* 0.524
Conformer + bi-LSTM 0.482 0.581 0.378 0.449

*Statistically significant difference ($32$0) for 03mn (Conformer Base vs. baseline).

Mel-Cepstral Distortion (MCD)

$32$1

Model 01fi 02fe 03mn 04me
2D-CNN baseline 3.221 3.009 3.641 3.172
Conformer Base 3.517* 3.121 4.133* 3.465*
Conformer + bi-LSTM 3.253 3.037 3.704 3.258*

*Entries marked by $32$2 denote $32$3 vs. baseline.

Across both metrics, neither conformer-based model demonstrates consistent significant improvements over the baseline for all speakers. Isolated differences are observed (notably, the Base variant performs worse than the baseline for certain metrics/speakers).

5. Subjective Speech Naturalness Evaluation

Subjective perceptual evaluation used a MUSHRA listening test. Twenty utterances (five per speaker) were rated by $32$4 listeners on a 0-100 scale, with the following average naturalness scores:

System Avg. Score
Lower anchor (noisy) ~18
2D-CNN baseline ~49
Conformer Base ~49
Conformer + bi-LSTM ~54
Natural speech ~92

Paired Wilcoxon tests revealed:

  • bi-LSTM vs. baseline: $32$5 (significant improvement)
  • Base vs. baseline: $32$6 (no difference)
  • bi-LSTM vs. Base: $32$7 (bi-LSTM better)

The Conformer + bi-LSTM variant thus delivers a perceptually significant gain, despite lacking clear superiority in objective metrics.

6. Interpretive Comparison and Significance

The bi-LSTM integration provides additional sequence modeling capacity by incorporating both past and future context at the frame level, which smooths prediction errors—especially in silent/unvoiced frames and during formant transitions. While Conformer blocks are effective for local and global context extraction, the inclusion of bi-LSTM layers enforces temporal consistency, leading to fewer artifacts and improved perceptual quality.

This suggests that, in the ultrasound-to-speech setting, hybrid architectures combining MHSA-based local/global feature modeling with recurrent temporal aggregation may be especially valuable when the evaluation focus is perceptual rather than solely objective.

7. Discussion and Implications

The Conformer with bi-LSTM outperforms both the Conformer Base and 2D-CNN baselines in subjective measures without incurring excessive training cost. A plausible implication is that further perceptual gains may be achievable by combining attention-based and recurrent modules in sequence-to-sequence acoustic mapping domains characterized by sparse or ambiguous local structure, such as silent speech interfaces.

Objective metrics may be insufficient for detecting perceptual quality improvements introduced by temporally global modeling, highlighting the necessity for human-in-the-loop evaluation in such systems. Future work could investigate architectural variants, alternative regularizations, or domain-specific pretraining within this paradigm (Ibrahimov et al., 4 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conformer with bi-LSTM.