LLaSM: Large Language and Speech Model (2308.15930v3)

Published 30 Aug 2023 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Multi-modal LLMs have garnered significant interest recently. Though, most of the works focus on vision-language multi-modal models providing strong capabilities in following vision-and-language instructions. However, we claim that speech is also an important modality through which humans interact with the world. Hence, it is crucial for a general-purpose assistant to be able to follow multi-modal speech-and-language instructions. In this work, we propose Large Language and Speech Model (LLaSM). LLaSM is an end-to-end trained large multi-modal speech-LLM with cross-modal conversational abilities, capable of following speech-and-language instructions. Our early experiments show that LLaSM demonstrates a more convenient and natural way for humans to interact with artificial intelligence. Specifically, we also release a large Speech Instruction Following dataset LLaSM-Audio-Instructions. Code and demo are available at https://github.com/LinkSoul-AI/LLaSM and https://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructions dataset is available at https://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

PDF Abstract

Overview of LLaSM: Large Language and Speech Model

The paper presents LLaSM, a large language and speech model designed to address limitations in current multi-modal models by incorporating speech as a critical interaction modality. The research introduces a novel approach to enhancing multi-modal interactions by integrating speech and language instructions, providing a more natural interaction with artificial intelligence systems.

Methodology

The authors introduce LLaSM as an end-to-end trained, multi-modal model that incorporates cross-modal conversational abilities. The architecture utilizes a pre-trained speech modality encoder—specifically, Whisper—alongside a LLM for embedding alignment. This is achieved through the use of a modal adaptor that aligns speech embeddings with text embeddings, allowing them to be processed together in a seamless interleaved sequence, enhancing the model's cross-modal processing capabilities.

The training process is divided into two stages:

Modality Adaptation Pre-training: This phase involves freezing the speech encoder and LLM, focusing on training the modal adaptor using automatic speech recognition (ASR) datasets. The objective is to align text and audio embeddings without intensive resource consumption, as only the adaptor's parameters are updated.
Cross-modal Instruction Fine-tuning: During this stage, the modal adaptor and LLM are fine-tuned using cross-modal instruction data. This phase emphasizes the model’s ability to handle conversations across modalities.

Data Collection

A significant contribution of the paper is the creation of the LLaSM-Audio-Instructions dataset, comprising 199,000 conversations and 508,000 samples. The dataset facilitates cross-modal instruction tuning by including both text and generated speech components, utilizing a variety of publicly available ASR datasets and text-to-speech technologies.

Key Results

The experiments conducted showcase LLaSM's proficiency in processing speech and language instructions in both English and Chinese. The direct integration of speech inputs without relying on conventional speech-to-text preprocessing steps enhances the model's efficiency and scope, supporting multiple languages and operational scenarios.

Implications and Future Work

The introduction of LLaSM addresses a critical gap in the existing landscape of multi-modal models by demonstrating the feasibility and benefits of integrating speech as a core interaction modality. It not only provides a more natural interaction framework for users but also sets the stage for further advancements in AI communication models.

The release of the LLaSM-Audio-Instructions dataset paves the way for future research in more sophisticated, cross-modal instruction-following systems. The paper suggests potential expansions in combining vision and audio modalities, indicating a path towards more holistic multi-modal AI systems.

By leveraging existing LLM and encoder technologies, LLaSM delivers a resource-efficient solution with significant practical applications in AI–human interaction. Further exploration in combining multiple sensory inputs could enhance general-purpose AI systems' capabilities, marking a significant step toward advanced human-AI interaction models.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yu Shu (9 papers)
Siwei Dong (13 papers)
Guangyao Chen (36 papers)
Wenhao Huang (98 papers)
Ruihua Zhang (4 papers)
Daochen Shi (1 paper)
Qiqi Xiang (1 paper)
Yemin Shi (18 papers)

Citations (38)

View on Semantic Scholar

LLaSM: Large Language and Speech Model (2308.15930v3)

Overview of LLaSM: Large Language and Speech Model

Methodology

Data Collection

Key Results

Implications and Future Work

Related Papers

GitHub

YouTube