Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Principled Approach to Data Valuation for Federated Learning (2009.06192v1)

Published 14 Sep 2020 in cs.LG, cs.CY, and stat.ML

Abstract: Federated learning (FL) is a popular technique to train ML models on decentralized data sources. In order to sustain long-term participation of data owners, it is important to fairly appraise each data source and compensate data owners for their contribution to the training process. The Shapley value (SV) defines a unique payoff scheme that satisfies many desiderata for a data value notion. It has been increasingly used for valuing training data in centralized learning. However, computing the SV requires exhaustively evaluating the model performance on every subset of data sources, which incurs prohibitive communication cost in the federated setting. Besides, the canonical SV ignores the order of data sources during training, which conflicts with the sequential nature of FL. This paper proposes a variant of the SV amenable to FL, which we call the federated Shapley value. The federated SV preserves the desirable properties of the canonical SV while it can be calculated without incurring extra communication cost and is also able to capture the effect of participation order on data value. We conduct a thorough empirical study of the federated SV on a range of tasks, including noisy label detection, adversarial participant detection, and data summarization on different benchmark datasets, and demonstrate that it can reflect the real utility of data sources for FL and has the potential to enhance system robustness, security, and efficiency. We also report and analyze "failure cases" and hope to stimulate future research.

Citations (175)

Summary

  • The paper introduces a federated Shapley value that adapts cooperative game theory to FL, enhancing fairness and addressing communication costs.
  • The paper develops efficient estimation methods using permutation sampling and group testing to approximate data values with minimal overhead.
  • The paper validates the approach on benchmarks like MNIST and CIFAR10, demonstrating its effectiveness in detecting noisy labels and adversarial contributions.

A Principled Approach to Data Valuation for Federated Learning

The paper "A Principled Approach to Data Valuation for Federated Learning" addresses the vital challenge of equitably appraising data sources within federated learning (FL) systems. FL is a machine learning paradigm where models are trained on decentralized data sources, preserving data privacy and circumventing legal constraints associated with data aggregation. The Shapley value (SV), a concept from cooperative game theory, is instrumental for distributing rewards fairly among data contributors, traditionally applied within centralized model training frameworks. This paper innovatively introduces a variant of SV tailored for FL, acknowledging the high communication costs and sequential nature of decentralized data training.

Key Contributions

  1. Federated Shapley Value: The authors propose the federated SV, a modification of the canonical SV suitable for FL. This novel value maintains the core attributes of the original SV—group rationality, fairness, and additivity—while adapting to the unique FL environment. It incorporates participation order and mitigates additional communication costs inherent in traditional SV calculations.
  2. Efficient Estimation: Given the computational intensity of determining SV, approximate methodologies are developed, including permutation sampling and group testing techniques. These strategies significantly reduce the computational overhead while providing robust approximations of data values for large numbers of participants in FL scenarios.
  3. Empirical Evaluation: Through experiments on benchmark datasets like MNIST and CIFAR10, the federated SV demonstrates effectiveness in tasks such as detecting noisy labels, identifying adversarial participants, and facilitating data summarization within federated models. The evaluation underscores federated SV's capability to faithfully represent data utility in FL tasks.

Implications and Future Directions

The practical implications of this research are profound. By implementing fair and accurate data valuation measures, it incentivizes continued participation from data owners, ensuring robust and efficient federated systems. The federated SV could potentially enhance FL security and efficiency by identifying and addressing low-quality or malicious data contributions.

Looking forward, this research opens avenues for exploring nuanced data valuation approaches that consider various data modalities and constraints encountered in real-world FL applications. There could be further development in approximation techniques to enhance scalability and efficiency, particularly where participant numbers are considerably large or where data heterogeneity significantly impacts model performance.

Overall, the proposed federated Shapley value not only innovates within the field of data valuation but also aligns federated learning strategies with equitable data contribution appraisal, promising a new standard in collaborative and decentralized model training approaches.