Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations (2206.08174v1)
Abstract: Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enroLLMent utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enroLLMent utterance. While most conventional approaches focus on improving {\it average performance} given a set of enroLLMent utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enroLLMent source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enroLLMent variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case performance by focusing on training with difficult enroLLMent cases where extraction does not perform well. In addition, we investigate the effectiveness of auxiliary speaker identification loss (SI-loss) as another way to improve robustness over enroLLMents. Experimental validation reveals the effectiveness of both worst-enroLLMent target training and SI-loss training to improve robustness against enroLLMent variations, by increasing speaker discriminability.