Serialized Output Training by Learned Dominance (2407.03966v1)

Published 4 Jul 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Ying Shi (33 papers)
Lantian Li (74 papers)
Shi Yin (28 papers)
Dong Wang (628 papers)
Jiqing Han (26 papers)

Citations (3)

View on Semantic Scholar

Tweets

https://twitter.com/AudioAndSpeech/status/1810357556688072864

Serialized Output Training by Learned Dominance (2407.03966v1)

Related Papers

Tweets