Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks (2401.10070v1)

Published 18 Jan 2024 in cs.CL, cs.SD, and eess.AS

Abstract: To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems, including automatic speech recognition (ASR) and speech translation (ST). However, the commonly used FL approach (i.e., \textsc{FedAvg}) in S2T tasks typically suffers from extensive communication overhead due to multi-round interactions based on the whole model and performance degradation caused by data heterogeneity among clients.To address these issues, we propose a personalized federated S2T framework that introduces \textsc{FedLoRA}, a lightweight LoRA module for client-side tuning and interaction with the server to minimize communication overhead, and \textsc{FedMem}, a global model equipped with a $k$-nearest-neighbor ($k$NN) classifier that captures client-specific distributional shifts to achieve personalization and overcome data heterogeneity. Extensive experiments based on Conformer and Whisper backbone models on CoVoST and GigaSpeech benchmarks show that our approach significantly reduces the communication overhead on all S2T tasks and effectively personalizes the global model to overcome data heterogeneity.

Citations (4)

View on Semantic Scholar

Summary

The paper proposes a two-stage framework using FedLoRA to reduce communication overhead by up to 96.5% on speech-to-text tasks.
It integrates a kNN-based FedMem component to capture client-specific speech data and enhance model personalization despite data heterogeneity.
Experimental results on CoVoST and GigaSpeech demonstrate competitive performance compared to centralized training.

Overview of Personalized Federated Learning for S2T

Federated learning (FL) is highly relevant to speech-to-text (S2T) tasks for preserving privacy and complying with legal standards. This learning method enables the collaborative training of a global model without sharing private client data, useful in applications such as automatic speech recognition (ASR) and speech translation (ST). However, FL faces the dual challenges of extensive communication overhead and a decline in performance due to data differences among clients. Addressing these issues, the paper introduces a new efficient and personalized FL framework for S2T tasks that incorporates two innovative strategies: FedLoRA and FedMem.

FedLoRA and FedMem: The Two-Stage Solution

The proposed FL framework operates in two stages. The first stage, named FedLoRA, focuses on reducing communication overhead by freezing the complex parts of the model and optimizing a lightweight module known as Low-Rank Adaptation (LoRA) for tuning. By only interacting with the server using this smaller module, communication and computational demands are significantly lessened.

The second stage, termed FedMem, introduces a k-nearest-neighbor (kNN) classifier to the global model for capturing client-specific speech data characteristics. This provides a level of personalization that mitigates performance issues due to data heterogeneity amongst clients. It does so by memorizing key representations of the client's data, enabling the global model to retrieve tailored information during inference, leading to more accurate results.

Experimental Validation

The proposed framework's efficacy is demonstrated through experiments on two benchmark datasets—CoVoST, reflecting dialect variations, and GigaSpeech, representing multi-domain speech data. The results show that the FedLoRA substantially reduces communication overhead by up to 96.5% while maintaining competitive or superior model performance compared to centralized training. In addition, it's shown that incorporating FedMem can enhance model personalization, offering improved resilience to data distribution disparities across different clients.

Conclusion and Future Work

This paper makes a compelling case for a new personalized FL framework tailored for S2T tasks. By combining efficient model training with personalization frameworks, it's possible to significantly cut down on bandwidth requirements without sacrificing accuracy. The method's ability to maintain performance while reducing communication load makes it an attractive option in the FL field. Future work could explore alternative memorization-retrieval methods for further optimization of inference speed and performance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nonlinear_human/status/1868004395138584684