EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions (2409.18042v2)

Published 26 Sep 2024 in cs.CV and cs.CL

Abstract: GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering LLMs to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-LLMs rely on external tools for the speech processing, while speech-LLMs still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable LLMs with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

PDF HTML Abstract

Insightful Overview of "EMOVA: Empowering LLMs to See, Hear, and Speak with Vivid Emotions"

The paper "EMOVA: Empowering LLMs to See, Hear, and Speak with Vivid Emotions" addresses the complex task of developing an end-to-end omni-modal LLM capable of processing and generating visual, textual, and speech data simultaneously. Named EMOVA, the model aims to bridge existing gaps in multi-modal understanding and interaction by incorporating emotional nuances in its spoken dialogues, a feature often overlooked in current models.

Introduction and Motivation

The challenge the paper tackles is the inherent limitation of existing omni-modal models, which typically rely on external tools for speech generation while excelling only in vision-language or speech-language tasks individually. Traditional vision-language or speech-LLMs do not encapsulate the entire spectrum of human-like interactions that include seeing, hearing, and speaking with emotional depth. To this end, the authors introduce EMOVA, which integrates a continuous vision encoder with a semantic-acoustic disentangled speech tokenizer, allowing the model to understand and generate highly contextualized and emotive spoken dialogues.

Architectures and Methods

Semantic-Acoustic Disentanglement: EMOVA employs a semantic-acoustic disentangled speech tokenizer, effectively separating semantic content (what is said) from acoustic style (how it is said). This disentanglement facilitates more precise alignment between the speech and text modalities, significantly enhancing the model's overall performance across different modalities.
Continuous Vision Encoder and Discrete Speech Tokenizer: The model uses InternViT-6B as its vision encoder, which extracts continuous visual features. For speech processing, a Speech-to-Unit (S2U) tokenizer coupled with a Unit-to-Speech (U2S) detokenizer is used. The S2U tokenizer converts speech into discrete units suitable for LLM input, whereas the U2S detokenizer reconstructs speech with emotional nuances based on style embeddings.
Text-Centric Omni-Modal Alignment: By leveraging publicly available bi-modal datasets, EMOVA achieves omni-modal alignment without the need for scarce omni-modal data. The text modality serves as a bridge, connecting vision and speech, thus training the model to handle all three modalities more robustly.

Experimental Results and Numerical Findings

The paper presents rigorous evaluations highlighting EMOVA's superior performance across both vision-language and speech benchmarks. Three notable observations include:

Mutual Enhancement Across Modalities: Integrating speech with vision-language data proved beneficial, with joint training outperforming sequential training methods. This suggests that multiple modalities can complement each other, enhancing the model's generalization capabilities.
Effectiveness of Semantic-Acoustic Disentanglement: Models using disentangled representations showed notable improvements over those using entangled ones, both in vision-language and speech tasks.
State-of-the-Art Performance: EMOVA set new benchmarks in both vision-language and speech tasks. For instance, it substantially outperformed other models in ASR (Automatic Speech Recognition), demonstrating a Word Error Rate (WER) of 4.0 on the LibriSpeech dataset, compared to 8.1 achieved by the closest competitor, VITA.

Implications and Future Directions

The development of EMOVA has substantial implications for the future of AI-driven human-computer interaction. By effectively integrating and aligning multiple modalities with emotional capabilities, EMOVA paves the way for more natural and interactive AI systems. Its ability to understand and generate emotionally nuanced spoken dialogue can significantly enhance applications ranging from virtual assistants to interactive educational tools and beyond.

Future Developments:

Direct Unit-to-Unit Generation: Future work could explore direct speech unit generation without text mediation, further streamlining speech synthesis and enhancing real-time interaction capabilities.
Duplex Modeling: Incorporating duplex communication, where the model can simultaneously process incoming data while generating output, can improve real-world applicability, particularly in environments requiring real-time responses.
Extended Vision Configurations: EMOVA could also benefit from integrating multiple vision encoders learned through diverse pre-training objectives. Extending its visual generation capabilities would further enrich its interaction potential.

In conclusion, this paper presents a significant advancement in the field of omni-modal LLMs by seamlessly integrating visual, textual, and emotional speech understanding and generation within a single model. EMOVA's ability to align and leverage multimodal data presents new opportunities for developing AI systems that are more attuned to human interaction paradigms.