Effective Modeling of Critical Contextual Information for TDNN-based Speaker Verification
Abstract: Today, Time Delay Neural Network (TDNN) has become the mainstream architecture for speaker verification task, in which the ECAPA-TDNN is one of the state-of-the-art models. The current works that focus on improving TDNN primarily address the limitations of TDNN in modeling global information and bridge the gap between TDNN and 2-Dimensional convolutions. However, the hierarchical convolutional structure in the SE-Res2Block proposed by ECAPA-TDNN cannot make full use of the contextual information, resulting in the weak ability of ECAPA-TDNN to model effective context dependencies. To this end, three improved architectures based on ECAPA-TDNN are proposed to fully and effectively extract multi-scale features with context dependence and then aggregate these features. The experimental results on VoxCeleb and CN-Celeb verify the effectiveness of the three proposed architectures. One of these architectures achieves nearly a 23% lower Equal Error Rate compared to that of ECAPA-TDNN on VoxCeleb1-O dataset, demonstrating the competitive performance achievable among the current TDNN architectures under the comparable parameter count.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.