Privacy Preserving Vertical Federated Learning for Tree-based Models (2008.06170v1)

Published 14 Aug 2020 in cs.CR and cs.LG

Abstract: Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies {\it vertical} federated learning, which tackles the scenarios where (i) collaborating organizations own data of the same set of users but with disjoint features, and (ii) only one organization holds the labels. We propose Pivot, a novel solution for privacy preserving vertical decision tree training and prediction, ensuring that no intermediate information is disclosed other than those the clients have agreed to release (i.e., the final tree model and the prediction output). Pivot does not rely on any trusted third party and provides protection against a semi-honest adversary that may compromise $m-1$ out of $m$ clients. We further identify two privacy leakages when the trained decision tree model is released in plaintext and propose an enhanced protocol to mitigate them. The proposed solution can also be extended to tree ensemble models, e.g., random forest (RF) and gradient boosting decision tree (GBDT) by treating single decision trees as building blocks. Theoretical and experimental analysis suggest that Pivot is efficient for the privacy achieved.

Citations (193)

View on Semantic Scholar

Summary

The paper introduces Pivot, a protocol that combines TPHE with MPC to address privacy challenges in vertical federated learning for tree-based models.
The paper demonstrates that its method attains predictive accuracy on par with non-private models while reducing communication overhead through parallel threshold decryption.
The paper outlines protocol variants that balance model transparency with enhanced privacy, offering practical insights for regulated sectors like finance and healthcare.

Privacy Preserving Vertical Federated Learning for Tree-based Models

The paper "Privacy Preserving Vertical Federated Learning for Tree-based Models" addresses the challenges of ensuring data privacy in the context of vertical federated learning (VFL) where collaborating entities possess distinct sets of features for the same user group, yet only one organization owns the labeling information. It introduces Pivot, a protocol designed to facilitate privacy-preserving training and predictions using decision trees and tree ensemble methods such as random forests (RF) and gradient boosting decision trees (GBDT), without the need for a trusted third party.

Overview and Methodology

Pivot stands out by addressing privacy concerns even under semi-honest adversary models where any $m-1$ out of $m$ participants could potentially collude to compromise data. The protocol ensures that only the final tree model and prediction output are disclosed, without unnecessary exposure of intermediate computations during training or inference. The primary strategy implemented involves the application of a combination of threshold partially homomorphic encryption (TPHE) and secure multiparty computation (MPC). TPHE is leveraged for local computations, significantly reducing communication overhead by obviating the need to share vast amounts of data, whereas MPC is retained for operations that are infeasible under TPHE alone, such as comparison tasks essential for decision tree algorithms.

The authors delineate two main protocols: the basic protocol which releases the entire decision tree in plaintext after training for operational transparency and enhanced protocol which withholds certain model-specific details to prevent reconstructing sensitive information from trained models. The latter addresses privacy indicators such as training label leakage and feature value disclosure, thus better protecting the model against inference attacks.

Experimental and Theoretical Evaluation

Experimental evaluations demonstrated that Pivot's approach yields predictive accuracy comparable to non-private benchmark models (NP-DT, NP-RF, NP-GBDT). The efficiency analysis revealed that the basic protocol could achieve speedups over purely MPC-based solutions, with particular gains when utilizing parallelism for tasks like threshold decryption. Moreover, Pivot's enhanced protocol can balance privacy protection and computational efficiency when compared to standard solutions. The parallel improvements extended to both training and prediction stages indicate practical capabilities for scalable application in federated learning environments.

Implications and Future Work

The theoretical implications of this paper reside in its demonstration that privacy-preserving computations can be effectively integrated into tree-based learning algorithms within a federated setup without compromising efficiency. Its contributions may influence future designs of privacy-focused learning frameworks not only restricted to tree models but potentially extendable to other machine learning paradigms exploitative of vertical data partitions.

Practically, the proposed method is relevant within industries like finance or healthcare, where feature-rich but sensitive datasets are common, and data privacy regulatory requirements (e.g., GDPR) are stringent. The introduction of differential privacy and extensions to the malicious model substantially strengthen its applicability.

In future developments, integrating advanced crypto-systems and further reducing computational overhead could refine the balance between privacy, efficiency, and usability. Additionally, expanding support to more complex models and exploring interoperability with other privacy approaches, such as homomorphic encryption or trusted execution environments, may advance the breakthroughs made in this paper.

Overall, the work of Wu et al. positions itself as a meaningful milestone in federated learning, emphasizing privacy preservation without sacrificing model integrity or conspicuous computational demands.

PDF Markdown