FLea: Addressing Data Scarcity and Label Skew in Federated Learning via Privacy-preserving Feature Augmentation (2312.02327v2)

Published 4 Dec 2023 in cs.LG, cs.CR, and cs.DC

Abstract: Federated Learning (FL) enables model development by leveraging data distributed across numerous edge devices without transferring local data to a central server. However, existing FL methods still face challenges when dealing with scarce and label-skewed data across devices, resulting in local model overfitting and drift, consequently hindering the performance of the global model. In response to these challenges, we propose a pioneering framework called \textit{FLea}, incorporating the following key components: \textit{i)} A global feature buffer that stores activation-target pairs shared from multiple clients to support local training. This design mitigates local model drift caused by the absence of certain classes; \textit{ii)} A feature augmentation approach based on local and global activation mix-ups for local training. This strategy enlarges the training samples, thereby reducing the risk of local overfitting; \textit{iii)} An obfuscation method to minimize the correlation between intermediate activations and the source data, enhancing the privacy of shared features. To verify the superiority of \textit{FLea}, we conduct extensive experiments using a wide range of data modalities, simulating different levels of local data scarcity and label skew. The results demonstrate that \textit{FLea} consistently outperforms state-of-the-art FL counterparts (among 13 of the experimented 18 settings, the improvement is over $5\%$) while concurrently mitigating the privacy vulnerabilities associated with shared features. Code is available at https://github.com/XTxiatong/FLea.git

PDF Abstract

In the field of Federated Learning (FL), where the goal is to build a robust global model using data distributed across multiple clients without centralizing the raw data, researchers have recently introduced an innovative framework named FLea. This framework tackles some of the critical challenges facing FL, particularly when the available datasets are small and exhibit significant disparities in label distribution, a common scenario in real-world applications like edge computing and geographically bound data collection.

The central issue with these limited, label-skewed datasets is that they tend to cause local models to over-fit and develop a bias towards the labels they see most frequently. When these biased models are aggregated to form a global model, they do not perform well. Existing methods, whether they are loss-based or data augmentation-based, are either insufficient or risk privacy when addressing this dual challenge of data scarcity and label skew.

FLea offers a solution by allowing clients to share features, rather than raw data, to enrich local training while preserving privacy. The features in question come from an intermediate layer of the neural network and are intentionally obscured before being shared to ensure sensitive information from the original data is not exposed. This exchange of obfuscated features creates a global proxy that enhances local models by providing a more diverse, representative sample of data.

Empirical results demonstrate that FLea significantly outperforms both loss-based and data augmentation-based baselines in scenarios with varied levels of data scarcity and label skew. Specifically, FLea has shown improvements of up to 17.6% over the state-of-the-art FL methods, cementing its effectiveness. Importantly, these gains do not come at the expense of privacy—FLea has been carefully designed to protect data privacy better than methods that share data directly or average over mini-batches.

The key components of FLea include a feature augmentation strategy that leverages both local and global features, and a knowledge distillation technique to prevent local model bias. The feature-level augmentation is a novel approach that sidesteps the need for sharing raw data entirely, instead sharing features that have been processed to maintain their usefulness for classification while reducing privacy risks.

Although FLea introduces some additional communication and storage overhead due to feature sharing, its benefits in performance gain and privacy preservation make it a promising approach for FL systems dealing with skewed and scarce data. However, there are trade-offs to consider in choosing which layer's features to share and in balancing classification performance with privacy protection.

Looking forward, challenges remain in improving FLea's efficiency for real-world deployment and in further enhancing the privacy of shared features. The hope is that FLea can pave the way for more sophisticated FL systems that can operate effectively even in the most challenging data environments.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Tong Xia (25 papers)
Abhirup Ghosh (145 papers)
Cecilia Mascolo (86 papers)
Xinchi Qiu (26 papers)

FLea: Addressing Data Scarcity and Label Skew in Federated Learning via Privacy-preserving Feature Augmentation (2312.02327v2)

Related Papers