CAFE: Catastrophic Data Leakage in Vertical Federated Learning (2110.15122v4)

Published 26 Oct 2021 in cs.LG and cs.AI

Abstract: Recent studies show that private training data can be leaked through the gradients sharing mechanism deployed in distributed machine learning systems, such as federated learning (FL). Increasing batch size to complicate data recovery is often viewed as a promising defense strategy against data leakage. In this paper, we revisit this defense premise and propose an advanced data leakage attack with theoretical justification to efficiently recover batch data from the shared aggregated gradients. We name our proposed method as catastrophic data leakage in vertical federated learning (CAFE). Comparing to existing data leakage attacks, our extensive experimental results on vertical FL settings demonstrate the effectiveness of CAFE to perform large-batch data leakage attack with improved data recovery quality. We also propose a practical countermeasure to mitigate CAFE. Our results suggest that private data participated in standard FL, especially the vertical case, have a high risk of being leaked from the training gradients. Our analysis implies unprecedented and practical data leakage risks in those learning settings. The code of our work is available at https://github.com/DeRafael/CAFE.

Citations (128)

View on Semantic Scholar

Summary

The paper introduces CAFE, a novel attack that can infer private data from shared gradients in Vertical Federated Learning (VFL), demonstrating that large batch sizes do not prevent leakage.
CAFE leverages data index and internal representation alignment with theoretical guarantees to effectively leak data from large batches, outperforming existing methods.
The research highlights significant VFL vulnerabilities and proposes a practical defense using fake gradients that preserves model accuracy while mitigating attacks.

An Analytical Overview of "CAFE: Catastrophic Data Leakage in Vertical Federated Learning"

The paper "CAFE: Catastrophic Data Leakage in Vertical Federated Learning" addresses the critical issue of data privacy in Vertical Federated Learning (VFL) systems. VFL, as opposed to Horizontal Federated Learning (HFL), involves multiple data custodians that collaboratively train machine learning models on vertically partitioned datasets, which have the same set of subjects but different features. This setup is commonly found in sectors like finance and healthcare where institutions aim to enhance predictive models by leveraging diverse feature sets, without exposing their raw data.

Problem Context and Key Contributions

Recent studies have shown vulnerabilities in the federated learning paradigm, notably in scenarios where shared gradients can leak private data. Traditionally, increasing the batch size of the data has been seen as a potential mitigation strategy against such leaks. However, the paper challenges this assumption and proposes a novel data leakage attack named CAFE (Catastrophic Data Leakage in Vertical Federated Learning). The authors provide strong evidence that CAFE can effectively and efficiently infer private data from shared aggregated gradients, even from large batches, thereby rendering previous assumptions about the security of batch size ineffective.

The contributions of the paper are significant and can be summarized as follows:

Advanced Attack Mechanism: CAFE introduces a robust algorithm that exploits data index and internal representation alignments in VFL, enabling the leakage of large batches of data. This is achieved by optimizing the recovery of internal data representations in three critical steps, leveraging available batch indices, and ensuring alignment.
Theoretical Foundation: The authors offer rigorous theoretical guarantees that back the performance of their proposed attack. These include strongly convex optimization landscapes that ensure convergence to correct gradients and data representations under specific conditions.
Mitigation Strategies: Recognizing the implications of their findings, the authors propose a defense mechanism that generates fake gradients. These manipulated gradients manage to preserve model accuracy while misleading potential attacks, thus offering a practical countermeasure to CAFE.
Comprehensive Evaluation: Extensive empirical results validate the efficacy of CAFE, demonstrating not only its superiority over existing methods but also the robustness of the proposed defense in mitigating data leakage without degrading model training performance.

Implications and Future Directions

The implications of this research are vast in both theoretical and practical dimensions. Theoretically, the work highlights fundamental vulnerabilities in the VFL framework and questions the reliability of batch size as a sole defense mechanism. Practically, it emphasizes the urgent need for secure protocols against data leakage and challenges designers of federated systems to reconsider conventional security assumptions.

Future research could build upon this work in several ways. Firstly, refining the granularity of theoretical models to cover more complex network architectures and diverse data types would enhance the understanding of attack vectors in federated systems. Additionally, exploring the interplay between federated learning and privacy-preserving technologies like differential privacy or homomorphic encryption could provide more holistic security frameworks.

The paper serves as a critical resource for researchers and practitioners aiming to fortify the security frameworks of VFL systems. As federated learning continues to evolve and gain traction across sensitive domains, ensuring its robustness against such data leakage threats will be paramount to its broader adoption and success.

Related Papers

GitHub

GitHub - DeRafael/CAFE (21 stars)