Federated Learning on Non-IID Data: A Survey (2106.06843v1)

Published 12 Jun 2021 in cs.LG and cs.DC

Abstract: Federated learning is an emerging distributed machine learning framework for privacy preservation. However, models trained in federated learning usually have worse performance than those trained in the standard centralized learning mode, especially when the training data are not independent and identically distributed (Non-IID) on the local devices. In this survey, we pro-vide a detailed analysis of the influence of Non-IID data on both parametric and non-parametric machine learning models in both horizontal and vertical federated learning. In addition, cur-rent research work on handling challenges of Non-IID data in federated learning are reviewed, and both advantages and disadvantages of these approaches are discussed. Finally, we suggest several future research directions before concluding the paper.

Authors (4)

Hangyu Zhu (12 papers)
Jinjin Xu (8 papers)
Shiqing Liu (7 papers)
Yaochu Jin (108 papers)

Citations (630)

View on Semantic Scholar

Summary

Federated Learning on Non-IID Data: A Survey

The paper "Federated Learning on Non-IID Data: A Survey" by Hangyu Zhu et al. provides a thorough examination of the challenges and strategies involved in Federated Learning (FL) when confronted with Non-IID (Non-Independent and Identically Distributed) data. This survey critically analyzes the implications of data heterogeneity on FL models and offers a comprehensive categorization of Non-IID scenarios, extending the current understanding in this burgeoning field.

Federated Learning is a pivotal approach in distributed machine learning, particularly in scenarios necessitating privacy preservation. Traditionally, centralized learning mandates data collation at a central location, raising privacy concerns. Conversely, FL decentralizes the learning process, enabling local model training and periodic aggregation at a central server without exposing raw data. However, a significant challenge arises when data distributed across clients exhibits Non-IID characteristics, often resulting in suboptimal model performance compared to a centrally trained model on IID data.

Key Contributions

The paper categorizes Non-IID data into several types, each impacting machine learning models differently:

Attribute Skew: Variance in data features across clients, which can be non-overlapping (vertical FL), partially overlapping, or fully overlapping (horizontal FL).
Label Skew: Differences in label distributions across clients, further divided into label distribution skew and label preference skew.
Temporal Skew: Involves changes in data distribution over time, affecting time-series and spatio-temporal data.
Other Scenarios: Includes cases such as "Attribute Content Label skew" and quantity skew.

The survey details how these heterogeneous data distributions challenge the federated learning process, particularly for parametric models like deep neural networks, which are sensitive to local data variances. This often leads to model divergence and slower convergence, necessitating specialized strategies to mitigate these effects.

Approaches to Address Non-IID Challenges

Zhu et al. categorize existing strategies into three main approaches:

Data-Based Approaches: Including data sharing and augmentation techniques aimed at balancing label distributions across clients.
Algorithm-Based Approaches: Encompassing personalized models, robust aggregation methods, and novel optimization techniques. Methods such as local fine-tuning, meta-learning, and knowledge distillation are prevalent, enabling models to adapt to client-specific data.
System-Based Approaches: Implementation of client clustering and system-level optimizations to better manage data variability and resource allocation.

Each approach is critically evaluated for its effectiveness in enhancing model accuracy and convergence speed while maintaining privacy standards. However, the paper also notes trade-offs such as increased communication costs and privacy risks associated with data sharing models.

Implications and Future Directions

While substantial progress has been made in addressing Non-IID challenges in FL, the paper acknowledges that many open questions remain. These include establishing rigorous privacy metrics, creating standardized benchmarks for algorithm comparison, and expanding the exploration of non-standard Non-IID scenarios, especially in vertical FL settings.

A promising avenue for future research lies in Federated Neural Architecture Search (FNAS), which seeks to optimize neural network architectures specifically for federated environments, while mitigating Non-IID effects.

In conclusion, this survey offers a detailed overview of the current landscape in handling Non-IID data in federated learning. It serves as a crucial reference for researchers aiming to improve the robustness and effectiveness of FL systems in real-world applications where data heterogeneity is the norm rather than the exception.

PDF Markdown