Stateless prediction networks for quick turn-taking in SURT
Determine whether replacing an LSTM-based prediction network with a stateless Conv1D prediction network in the SURT recognition module improves handling of rapid speaker turn-taking (short inter-utterance pauses), thereby reducing word error rates under short-silence multi-talker conditions.
References
In addition to improving computational efficiency, we conjecture that a stateless network should also be better suited to handle quick turn-taking (sub-task~(3) described earlier).
— Listening to Multi-talker Conversations: Modular and End-to-end Perspectives
(2402.08932 - Raj, 14 Feb 2024) in Chapter 6 (Streaming Unmixing and Recognition Transducers), Section “Network architecture” (Recognition module)