Contextualized inputs for the auxiliary speaker encoder

Determine whether feeding contextualized intermediate encoder representations (e.g., the first Zipformer block output) to the auxiliary speaker encoder improves token-synchronous speaker labeling in SURT relative to using shallow, non-contextual embeddings.

Background

Ablations varying the input layer to the auxiliary speaker encoder showed best cpWER/WDER when using an early contextualized representation rather than raw convolutional embeddings or deeper layers. The authors hypothesize contextualized input is needed to synchronize speaker labels with ASR tokens across branches.

Verifying this conjecture would guide how to tap the main encoder for optimal speaker-branch inputs.

References

We conjecture that the input to the auxiliary encoder needs contextualized representations since speaker labels need to be synchronized across the two branches.

— Listening to Multi-talker Conversations: Modular and End-to-end Perspectives (2402.08932 - Raj, 14 Feb 2024) in Chapter 7 (Speaker Attribution in the SURT Framework), Section “Auxiliary encoder position”

Contextualized inputs for the auxiliary speaker encoder

Background

References

Related Problems