Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear-Complexity Self-Supervised Learning for Speech Processing (2407.13377v1)

Published 18 Jul 2024 in cs.CL, cs.AI, and eess.AS

Abstract: Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code is available at https://github.com/SamsungLabs/SummaryMixing.

Citations (1)

Summary

  • The paper introduces SummaryMixing, a linear-complexity context encoder that replaces the quadratic-complexity multi-headed self-attention in self-supervised learning models for speech processing.
  • SummaryMixing uses a two-branch architecture to achieve linear complexity, resulting in up to an 18% reduction in pre-training time and 23% less VRAM usage compared to the standard attention mechanism.
  • Evaluations show SummaryMixing matches or outperforms the standard attention mechanism on downstream tasks like ASR, IC, ER, and ASV, making self-supervised learning models more accessible and sustainable.

Linear-Complexity Self-Supervised Learning for Speech Processing

The paper "Linear-Complexity Self-Supervised Learning for Speech Processing" introduces an innovative approach to addressing the computational inefficiencies associated with self-supervised learning (SSL) models for speech processing. The authors focus on reducing the pre-training costs, both in terms of time and hardware resources, by replacing the traditional multi-headed self-attention (MHSA) mechanism with a more efficient alternative named SummaryMixing.

Background and Problem Statement

SSL models have excelled in speech processing tasks by leveraging vast amounts of unlabeled data to learn informative representations. However, the quadratic complexity of MHSA, which serves as the context encoder in such models, results in significant computational demand, requiring extensive GPU resources and time for pre-training. This not only poses financial costs but also environmental concerns due to high energy consumption.

SummaryMixing: A Linear-Complexity Context Encoder

To address this, the paper explores SummaryMixing as a linear-complexity context encoder capable of maintaining competitive performance against MHSA in SSL settings. SummaryMixing operates by splitting the model into two branches: a local branch capturing granular information through a point-wise feed-forward network, and a summary branch capturing global information by averaging transformed input vectors. This results in reduced time complexity and VRAM usage while maintaining or exceeding performance on various speech processing tasks.

Experimental Results

The wave2vec 2.0 (w2v2) model with SummaryMixing was evaluated against its MHSA counterpart. Pre-training was carried out using the Libri-Light Medium subset, with the efficient Mel filterbanks plus a shallow 1D CNN replacing the original feature extractor to further enhance preprocessing efficiency. The SummaryMixing-enabled w2v2 exhibited an 18% and 23% reduction in pre-training time and VRAM usage, respectively, compared to the MHSA-based w2v2.

Downstream task evaluations on ASR, IC, ER, and ASV, orchestrated through the MP3S benchmark, highlighted SummaryMixing's ability to match or outperform MHSA. Notably, the SummaryMixing w2v2 achieved a significant relative reduction in word error rate for ASR on both high-resource (LibriSpeech) and low-resource (Welsh and Basque) datasets. Moreover, SummaryMixing demonstrated superior generalization for IC and ASV while remaining competitive for ER.

Implications and Future Directions

This work provides a pivotal step towards more efficient SSL models, highlighting SummaryMixing's potential to significantly cut down the computational burden of training, rendering SSL approaches more accessible and environmentally sustainable. This framework opens several avenues for future exploration, notably its application across other SSL models beyond wave2vec 2.0, fine-tuning strategies with SummaryMixing, and further architectural optimizations of the context encoder.

By reducing the computational requirements of SSL, the approach promises to democratize access to high-performance speech models, facilitating broader research and industrial applications without necessitating extensive hardware resources.