Streamable causal architectures for discrete audio tokenizers

Develop causal, streamable architectures for discrete audio tokenizers that can operate in real time while maintaining high perceptual quality and computational efficiency, overcoming the current reliance of many self-supervised learning–based tokenizers on non-causal encoders.

Background

The paper defines streamability as the ability to process and generate audio in real time with minimal latency, highlighting both algorithmic latency (look-ahead/future context) and computational complexity constraints for deployment on resource-constrained devices.

It notes that many SSL-based tokenizers use non-causal encoders, which hinders real-time applications. Achieving low-latency, high-quality, and efficient causal designs is therefore identified as a key unmet need.

References

Thus, achieving streamability with high-quality and efficient causal architectures remains an open research challenge.

Discrete Audio Tokens: More Than a Survey! (2506.10274 - Mousavi et al., 12 Jun 2025) in Section 2.5 (Streamability and Domain Categorization) – Streamability paragraph