Bridging discrete–continuous representation gaps for multimodal LLM integration

Develop methods that bridge the representational and performance differences between discrete audio tokenizers and continuous audio features to enable effective integration of discrete audio tokens into multimodal large language models that require semantic richness.

Background

The paper observes that continuous features often outperform discrete tokens on speech-language understanding tasks that rely on fine-grained acoustic cues, while discrete tokens can be advantageous for autoregressive or masked generative modeling.

It emphasizes that these modality- and objective-dependent differences impede seamless use of discrete tokens in multimodal LLMs, motivating techniques to reconcile or unify discrete and continuous representations.

References

Bridging these differences remains an open challenge for integrating audio tokenizers into multimodal LLMs that require semantic richness.

— Discrete Audio Tokens: More Than a Survey! (2506.10274 - Mousavi et al., 12 Jun 2025) in Conclusion and Future Directions – Bullet point “Discrete vs. Continuous Representations”

Bridging discrete–continuous representation gaps for multimodal LLM integration

Background

References

Related Problems