Spirit LM: Interleaved Spoken and Written Language Model (2402.05755v2)

Published 8 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We introduce Spirit LM, a foundation multimodal LLM that freely mixes text and speech. Our model is based on a 7B pretrained text LLM that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. Spirit LM comes in two versions: a Base version that uses speech phonetic units (HuBERT) and an Expressive version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that Spirit LM can learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification). We make available model weights and inference code.

PDF Abstract

The Unveiling of SPIRIT-LM: A Multimodal Leap for LLMs

The field of LLMs has long been dominated by text-centric architectures, focusing primarily on the written word to achieve understanding and generation. However, a groundbreaking shift is on the horizon with the introduction of SPIRIT-LM, a foundation multimodal LLM that seamlessly integrates both spoken and written language modalities in its training and application. Developed by a collaborative team from Meta AI, Inria, Paris, EHESS, ENS-PSL, and CNRS, Paris, this model represents a significant advancement in the way we approach language processing tasks.

Bridging Speech and Text

At the core of SPIRIT-LM's innovation is its ability to interleave speech and text data during training. This approach allows the model to not just understand but also generate content across modalities, effectively translating text to speech and vice versa. It's a leap forward from previous models that treated speech and text separately, often relying on piecemeal pipelines for tasks such as text-to-speech (TTS) conversion and automatic speech recognition (ASR).

SPIRIT-LM comes in two versions, the base and an expressive variant. Both employ subword Byte Pair Encoding (BPE) tokens for text and leverage a novel encoding for speech based on clustering speech units, known as the HuBERT tokenizer. The expressive variant goes further by incorporating pitch and style tokens, offering an unprecedented level of nuance in speech generation.

A Performance Overview

SPIRIT-LM's performance is commendable across a variety of comprehension and generation tasks. When evaluated against established benchmarks, it not only competes strongly with its predecessors but also sets new standards in some areas. Specifically, it excels in preserving the sentiment of prompts across modalities, a critical capability for maintaining coherence in generated content.

For instance, in the Speech-Text Sentiment Preservation task, the expressive version of SPIRIT-LM showed a marked ability to maintain the emotional tone of input prompts in its output, irrespective of the modality switch. This capacity for cross-modal sentiment preservation is a testament to the model’s nuanced understanding of language.

Addressing Added Toxicity

In line with responsible AI development practices, the paper also explores added toxicity detection. It's an essential consideration since LLMs can inadvertently amplify biases present in their training data. While SPIRIT-LM exhibits some degree of added toxicity, primarily when generating speech from speech prompts, its overall performance remains within acceptable bounds. Addressing this will be a focus of future improvements, underscoring the team's commitment to ethical AI development.

Future Directions and Impact

The introduction of SPIRIT-LM paves the way for a new generation of LLMs that understand and generate human language more holistically, accounting for both its spoken and written forms. This advancement holds promise for a variety of applications, from enhanced conversational AI and more accessible user interfaces to richer, more context-aware content generation.

Undoubtedly, as the model scales and undergoes refinement, its potential will continue to expand. The team behind SPIRIT-LM is poised to address its current limitations, such as the optimization of its training procedure and the expansion of its language coverage beyond English.

In summary, SPIRIT-LM represents a significant milestone in the pursuit of truly multimodal LLMs. Its ability to understand and bridge the nuances of spoken and written language offers exciting possibilities for the future of natural language processing and artificial intelligence at large.