Soundwave: Less is More for Speech-Text Alignment in LLMs (2502.12900v1)

Published 18 Feb 2025 in cs.CL, cs.AI, and cs.SD

Abstract: Existing end-to-end speech LLMs usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

PDF Abstract

The paper presents a comprehensive framework for data‐efficient speech–text alignment in LLMs through a carefully designed three‐stage training strategy. It addresses two central challenges when interfacing speech with text: (1) the spatial misalignment between representations produced by pretrained audio encoders and those expected by text models, and (2) the discrepancy in sequence lengths between audio (frame‐level) inputs and sub-word text tokens. The proposed solution integrates specialized modules and training stages that jointly mitigate these mismatches while significantly reducing the amount of required training data.

Three-Stage Training Strategy

Stage I: Alignment
- An auxiliary connectionist temporal classification (CTC) loss is applied to an alignment adapter that bridges the gap between the audio encoder’s output and the LLM’s embedding space.
- High-quality speech transcription data (selected with a word error rate below 10%) is used to accelerate convergence.
- This stage decouples the representation alignment from the end-to-end task, allowing for fast training of only a few additional parameters.
Stage II: Shrinking
- A shrinking adapter compresses the long audio sequence while retaining critical semantic and paralinguistic information.
- The method uses the peak probabilities from the CTC output to determine token boundaries and applies a cross-attention mechanism to fuse auxiliary speech cues (e.g., tone, pitch) with the selected content features.
- Additionally, a dynamic data mixture strategy, guided by temperature-based sampling, is proposed to balance learning from audio tasks with widely imbalanced data scales and text-based tasks.
Stage III: Supervised Fine-Tuning (SFT)
- In the final stage, only low-rank adaptation (LoRA) parameters are tuned while freezing most of the model, thereby refining the system for diverse downstream tasks and enabling direct responses based on speech input.
- The authors incorporate both speech and text instructions, along with chain-of-thought (CoT) reasoning, to help the model handle complex queries and preserve the rich knowledge encoded in LLMs.

Model Architecture and Implementation

The framework employs Whisper Large V3 as the pretrained audio encoder and Llama-3.1-8B-Instruct as the base LLM.
Two adapters are introduced: the alignment adapter (a projection layer followed by a Transformer layer) and the shrinking adapter (which performs token selection and auxiliary information fusion via cross-attention).
The use of LoRA in fine-tuning enables efficient adaptation with a minimal number of additional parameters.

Experimental Evaluation and Numerical Results

The model is trained on approximately 10k hours of filtered speech data, leveraging high-quality ASR and sound datasets, yet utilizes only a fraction of the data compared to previous systems such as Qwen2-Audio.
On foundational tasks, the paper reports:
- Speech Translation: Achieving BLEU scores of 30.6 (CoVoST2 En-De), outperforming advanced baselines.
- Speech Emotion Recognition: Accuracy improvements from 55.3% to 63.5% compared to established models.
- Zero-Shot Performance: Notable improvements in multilingual speech translation tasks with significant gains in BLEU (e.g., from 20.7 to 27.0 on En-Nl).
On the AIR-Bench speech foundation tasks, the aggregated performance of the proposed approach exceeds competing models by substantial margins on several subtasks (e.g., performance improvements in tasks such as gender and entity recognition) even though the system underperforms slightly on pure ASR tasks, which the authors attribute to limitations in training data size.

Ablation Studies and Analysis

Convergence curves illustrate that incorporating the alignment stage leads to a rapid loss decrease within the first hundred training steps.
Comparisons of feature similarity between audio and text representations demonstrate that the method substantially minimizes representation space gaps compared with alternatives.
An analysis of different shrinking strategies confirms that the proposed approach maintains performance while achieving compression ratios as low as 2.5%, leading to improvements in inference speed (up to 25% speed-up over certain baselines).

Data Engineering and Dynamic Mixture Strategy

The system benefits from meticulous data engineering, including the cleaning of ASR and sound data and the use of SpecAugment techniques, which together mitigate training instability.
The dynamic data mixture strategy employs a temperature-based re-sampling technique to balance richly resourced and scarce tasks during Stage II, ensuring that signal-rich modalities and sparse labels are effectively integrated.

Limitations and Future Work

The paper acknowledges the need to validate the approach on larger-scale models to assess scalability.
Current results indicate challenges in music understanding and multilingual support, suggesting that extending annotated sound data and diverse multilingual datasets is a necessary direction for future research.
Finally, the authors plan to explore music and cross-lingual capabilities to further generalize the framework.

Overall, the contributions of the paper lie in its innovative modular training strategy that effectively addresses cross-modal representation and sequence length inconsistencies, yielding competitive performance with significantly reduced training data compared to existing large speech LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yuhao Zhang (107 papers)
Zhiheng Liu (22 papers)
Fan Bu (23 papers)
Ruiyu Zhang (12 papers)
Benyou Wang (109 papers)
Haizhou Li (285 papers)

Related Papers

Find Related Papers

GitHub

GitHub - FreedomIntelligence/Soundwave: The official Soundwave repository (6 stars)

Tweets

https://twitter.com/susumuota/status/1901787334510862673

https://twitter.com/danpacary/status/1895192760824799516

https://twitter.com/susumuota/status/1901787345310929303