Contextualized inputs for the auxiliary speaker encoder
Determine whether feeding contextualized intermediate encoder representations (e.g., the first Zipformer block output) to the auxiliary speaker encoder improves token-synchronous speaker labeling in SURT relative to using shallow, non-contextual embeddings.
References
We conjecture that the input to the auxiliary encoder needs contextualized representations since speaker labels need to be synchronized across the two branches.
— Listening to Multi-talker Conversations: Modular and End-to-end Perspectives
(2402.08932 - Raj, 14 Feb 2024) in Chapter 7 (Speaker Attribution in the SURT Framework), Section “Auxiliary encoder position”