Hardness of Joint Fine-Tuning the eGeMAPS Estimator with Enhancement Models

Determine whether jointly fine-tuning the eGeMAPS estimator together with the speech enhancement model (such as Demucs or FullSubNet) creates a harder optimization problem than fine-tuning the enhancement model while keeping the eGeMAPS estimator fixed, and clarify whether joint fine-tuning can add robustness to enhanced speech inputs.

Background

In the ablation study, the authors examined different configurations for pairing their differentiable eGeMAPS estimator with speech enhancement models. They pre-trained the VAE encoder for the estimator and explored whether to pre-train or jointly train the final linear layers during enhancement-model fine-tuning. Attempts to train these layers while fine-tuning the enhancement model did not converge.

Initial experiments indicated that fine-tuning the estimator parameters together with the enhancement model led to worse performance compared to keeping the estimator fixed. The authors hypothesized that joint fine-tuning could add robustness to enhanced speech inputs but conjectured that it introduces a more difficult optimization landscape. This raises the unresolved question of whether joint fine-tuning is intrinsically harder and under what conditions it may succeed or improve robustness.

References

We hypothesized that fine-tuning the estimator could add robustness to enhanced speech as input, but we conjecture that it creates a harder optimization problem.

Improving Speech Enhancement through Fine-Grained Speech Characteristics  (2207.00237 - Yang et al., 2022) in Section 4.4 (Ablation Study of eGeMAPS Estimator)