Stateless prediction networks for quick turn-taking in SURT

Determine whether replacing an LSTM-based prediction network with a stateless Conv1D prediction network in the SURT recognition module improves handling of rapid speaker turn-taking (short inter-utterance pauses), thereby reducing word error rates under short-silence multi-talker conditions.

Background

The SURT recognition module can use either LSTM-based or stateless Conv1D prediction networks. The authors hypothesize that stateless decoders better accommodate frequent context switching inherent in quick turn-taking.

Establishing whether stateless prediction networks systematically outperform LSTM predictors for rapid turn-taking would inform SURT architecture choices for conversational speech.

References

In addition to improving computational efficiency, we conjecture that a stateless network should also be better suited to handle quick turn-taking (sub-task~(3) described earlier).

— Listening to Multi-talker Conversations: Modular and End-to-end Perspectives (2402.08932 - Raj, 2024) in Chapter 6 (Streaming Unmixing and Recognition Transducers), Section “Network architecture” (Recognition module)

Stateless prediction networks for quick turn-taking in SURT

Background

References

Related Problems