SSR: Alignment-Aware Modality Connector for Speech Language Models (2410.00168v2)

Published 30 Sep 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Fusing speech into pre-trained LLM (SpeechLM) usually suffers from inefficient encoding of long-form speech and catastrophic forgetting of pre-trained text modality. We propose SSR-Connector (Segmented Speech Representation Connector) for better modality fusion. Leveraging speech-text alignments, our approach segments and compresses speech features to match the granularity of text embeddings. Additionally, we introduce a two-stage training pipeline that includes the distillation and fine-tuning phases to mitigate catastrophic forgetting. SSR-Connector outperforms existing mechanism for speech-text modality fusion, consistently achieving better speech understanding (e.g., +10 accuracy on StoryCloze and +20 on Speech-MMLU) while preserving pre-trained text ability.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the SSR-Connector, which segments and compresses speech features using speech-text alignment to match text embeddings.
It employs a two-stage training pipeline, including distillation and multitask fine-tuning, to address the issue of catastrophic forgetting.
Experimental results show significant improvements in SLU and Speech-MMLU tasks, outperforming baseline models in multimodal NLP.

SSR: Alignment-Aware Modality Connector for Speech LLMs

The paper "SSR: Alignment-Aware Modality Connector for Speech LLMs" proposes an innovative approach for integrative Speech-LLMs (SpeechLMs), addressing significant challenges previously encountered in the field. Specifically, the SSR-Connector (Segmented Speech Representation Connector) is introduced as an efficient and effective solution for modality fusion between speech and pre-trained LLMs.

Introduction and Motivation

The research addresses the inefficiencies and limitations of current SpeechLMs, particularly focusing on two primary pain points: the inefficient encoding of long-form speech and the catastrophic forgetting of pre-trained text modality. Two main strategies are commonly employed for modality integration in SpeechLMs: connector-based methods and unit-based fusion. However, both have their drawbacks in terms of efficiency and preservation of the LLM's pre-trained capabilities.

Proposed Method: SSR-Connector

The SSR-Connector leverages speech-text alignments to segment and compress speech features effectively, aligning them with the granularity of text embeddings. The methodology includes a two-stage training pipeline to mitigate catastrophic forgetting. The first stage involves distillation, where speech features are adapted to text embeddings' semantic space. The second stage involves fine-tuning, which seeks to enhance the LLM's ability to process the adapted features without significant degradation of its original text capabilities.

Detailed Mechanism

The SSR-Connector employs a speech-text aligner and a feature compressor. The aligner maps speech features to their corresponding transcription, enabling segmentation into chunks that match the granularity of text tokens. A Transformer Decoder is then utilized for feature compression, transforming segmented speech features into representations akin to text embeddings.

Alignment Techniques

The paper experiments with several alignment techniques:

UnitY2 Aligner
CTC-based Aligners (Char-CTC and Sub-CTC)
Continuous Integrate-and-Fire (CIF)

Among these, the UnitY2 aligner and Char-CTC demonstrated superior performance in aligning speech features with text embeddings, evident from low word error rates (WER) in ASR tasks.

Experimental Results

Spoken Language Understanding (SLU):

The SSR-Connector outperforms previous models across several SLU tasks such as sWUGGY, sBLIMP, and StoryCloze (both speech and text modalities). The alignment-aware method provides significant improvements in speech understanding while maintaining superior performance over traditional models like SpiritLM, VoxtLM, and others.

Speech-MMLU and Prompt-based ASR:

The SSR-Connector significantly excels in Speech-MMLU tasks, which require cross-modal understanding, showcasing a +20 improvement in accuracy. The model also demonstrates efficient transcription capabilities in both zero-shot and few-shot ASR settings, outperforming baseline models.

Fine-tuning and Catastrophic Forgetting

To address catastrophic forgetting during modality fusion, the paper explores various fine-tuning strategies:

Vanilla Fine-tuning
LoRA Fine-tuning
Multitask Fine-tuning

Multitask fine-tuning proved to be the most effective strategy, balancing improvements in speech understanding with minimal degradation of text capabilities.

Implications and Future Directions

The SSR-Connector sets a new precedent in multimodal LLMs, demonstrating that alignment-aware, segmented feature compression can substantially enhance the performance and efficiency of SpeechLMs. This research has practical implications for the development of more robust and efficient multimodal NLP systems, especially in applications that necessitate seamless integration of speech and text modalities.

Future research could explore further optimization of alignment techniques, more sophisticated compression strategies, and the integration of additional modalities like vision or structured data, extending the versatility and applicability of multimodal LLMs.

Conclusion

The SSR-Connector represents a significant advancement in the field of SpeechLMs, aligning speech features precisely with text embeddings and addressing the persistent issue of catastrophic forgetting through a well-structured training pipeline. While it achieves remarkable results across several benchmarks, continued research and refinement could unlock even greater potential for multimodal LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/weiting_nlp/status/1841642833226138069

https://twitter.com/beomseok_lee_/status/1859261803592601786

https://twitter.com/weiting_nlp/status/1841643555078480371