Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners (2412.04917v1)

Published 6 Dec 2024 in cs.SD, eess.AS, and eess.SP

Abstract: Recent advances in GPT-4o like multi-modality models have demonstrated remarkable progress for direct speech-to-speech conversation, with real-time speech interaction experience and strong speech understanding ability. However, current research focuses on discrete speech tokens to align with discrete text tokens for LLMling, which depends on an audio codec with residual connections or independent group tokens, such a codec usually leverages large scale and diverse datasets training to ensure that the discrete speech codes have good representation for varied domain, noise, style data reconstruction as well as a well-designed codec quantizer and encoder-decoder architecture for discrete token LLMling. This paper introduces Flow-Omni, a continuous speech token based GPT-4o like model, capable of real-time speech interaction and low streaming latency. Specifically, first, instead of cross-entropy loss only, we combine flow matching loss with a pretrained autoregressive LLM and a small MLP network to predict the probability distribution of the continuous-valued speech tokens from speech prompt. second, we incorporated the continuous speech tokens to Flow-Omni multi-modality training, thereby achieving robust speech-to-speech performance with discrete text tokens and continuous speech tokens together. Experiments demonstrate that, compared to discrete text and speech multi-modality training and its variants, the continuous speech tokens mitigate robustness issues by avoiding the inherent flaws of discrete speech code's representation loss for LLM.

PDF HTML Abstract

Continuous Speech Tokens Enhance Robustness in LLM Multi-Modality Learning

The paper at hand introduces the Flow-Omni, a novel approach to multi-modality LLMing that utilizes continuous speech tokens to address limitations present in current discrete speech token-based models. This work departs from conventional methods that predominantly feature large-scale audio codec reliance and discrete tokenization. The authors aim to enhance the robustness and efficiency of LLMs capable of real-time speech interaction and low streaming latency.

Core Contributions and Methodology

Flow-Omni incorporates several innovations that mark a departure from traditional discrete token methodologies:

Continuous Mel-Spectrogram Representation: The model leverages mel-spectrograms as an intermediate representation for audio data, eschewing discrete audio codes derived from audio codecs. This shift aims to reduce the representation loss inherent in discrete tokenization methods, particularly for unseen examples or varying input conditions such as high pitch or emotional intensity.
Flow Matching Loss Integration: To predict continuous mel-spectrograms effectively, Flow-Omni integrates a flow matching loss with a pretrained autoregressive LLM supplemented by a small MLP network. This architectural choice is designed to mitigate the shortcomings of discrete speech token-based methods, which struggle with token representation fidelity and generalization to novel conditions.
Modality Integration and Training: The model merges discrete text tokens with continuous speech tokens, allowing for joint multimodal training. This results in more robust speech-to-speech interactions, as the system adapts better to the intrinsic quality variations within audio data.

Experimental Evaluation

The authors provide a comprehensive evaluation using datasets such as Aishell-1, WenetSpeech, and LCCC-base configured to test the Flow-Omni’s speech understanding and TTS tasks. Comparative results with a Mini-Omni baseline demonstrate the superiority of the Flow-Omni in speech recognition accuracy, yielding a Word Error Rate (WER) of 8.81%, compared to 10.84% for the discrete token-based Mini-Omni.

Implications and Future Direction

The advancements presented through Flow-Omni have significant implications for the development of LLMs in multi-modality contexts. The continuous speech token framework enhances speech generative tasks' flexibility and reduces dependency on intricate preprocessing pipelines typical of discrete token systems. From a practical standpoint, this can lead to more intuitive and natural user interactions across diverse linguistic and acoustic situations.

Theoretically, adopting continuous representations aligns with ongoing trends in AI that favor flexible, end-to-end systems capable of seamless interaction between paradigms such as text, speech, and potentially visual data. Moving forward, the potential exploration of continuous tokenization across broader datasets, incorporating more complex environmental variables, and enhancing the generative adversarial frameworks could further elevate the quality and applicability of multimodal LLMs. Integrating such models with existing smart assistant technologies or even in novel domains such as real-time translation could be areas of substantial development.

In conclusion, the presented work on Flow-Omni marks a significant stride in overcoming limitations inherent in discrete speech token-based multimodal models, underscoring the potential for refined speech processing capabilities in contemporary LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Ze Yuan (4 papers)
Yanqing Liu (48 papers)
Shujie Liu (101 papers)
Sheng Zhao (75 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1866618691330380270

https://twitter.com/gm8xx8/status/1865980078200381645