Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks (2307.02040v3)

Published 5 Jul 2023 in cs.LG and cs.AI

Abstract: Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. McCallum Andrew. Real vs. simulated, 2015. URL https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/binary/real-sim.bz2.
  2. Anonymized, 2023. URL https://drive.google.com/drive/folders/1Ti73Doy7xW0BRv2D8FHZFqSlzZWfd2gj.
  3. Kirk Baker. Singular value decomposition tutorial. The Ohio State University, 24, 2005.
  4. T. Bertin-Mahieux. Yearpredictionmsd, 2011. URL https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/regression/YearPredictionMSD.bz2.
  5. Jock Blackard. Covertype, 1998. URL https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/multiclass/covtype.bz2. DOI: https://doi.org/10.24432/C50K5N.
  6. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
  7. Compressed-VFL: Communication-efficient learning with vertically partitioned data. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2738–2766. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/castiglia22a.html.
  8. Synthetic learning: Learn from distributed asynchronized discriminator gan without sharing medical image data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  9. Fedmsplit: Correlation-adaptive federated multi-task learning across multimodal split networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 87–96, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393850. doi: 10.1145/3534678.3539384. URL https://doi-org.libproxy1.nus.edu.sg/10.1145/3534678.3539384.
  10. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  11. Secureboost: A lossless federated learning framework. IEEE Intelligent Systems, 36(6):87–98, 2021.
  12. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), Santorini, Greece., 2009.
  13. Creative Commons. Attribution 4.0 international, 2023a. URL https://creativecommons.org/licenses/by/4.0/legalcode.
  14. Creative Commons. Attribution-noncommercial 4.0 international (cc by-nc 4.0), 2023b. URL https://creativecommons.org/licenses/by-nc/4.0/.
  15. Open high-resolution satellite imagery: The worldstrat dataset –with application to super-resolution. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 25979–25991. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/a6fe99561d9eb9c90b322afe664587fd-Paper-Datasets_and_Benchmarks.pdf.
  16. Comparing the pearson and spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychological methods, 21(3):273, 2016.
  17. Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  18. Gal: Gradient assisted learning for decentralized multi-organization collaborations. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 11854–11868. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/4d6938f94ab47d32128c239a4bfedae0-Paper-Conference.pdf.
  19. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7):826–838, 2004.
  20. Vf2boost: Very fast vertical federated gradient boosting for cross-enterprise learning. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 563–576, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383431. doi: 10.1145/3448016.3457241. URL https://doi-org.libproxy1.nus.edu.sg/10.1145/3448016.3457241.
  21. Towards communication-efficient vertical federated learning training via cache-enabled local updates. Proc. VLDB Endow., 15(10):2111–2120, jun 2022a. ISSN 2150-8097. doi: 10.14778/3547305.3547316. URL https://doi-org.libproxy1.nus.edu.sg/10.14778/3547305.3547316.
  22. Blindfl: Vertical federated machine learning without peeking into your data. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, page 1316–1330, New York, NY, USA, 2022b. Association for Computing Machinery. ISBN 9781450392495. doi: 10.1145/3514221.3526127. URL https://doi-org.libproxy1.nus.edu.sg/10.1145/3514221.3526127.
  23. Biased random-key genetic algorithms for combinatorial optimization. Journal of Heuristics, 17(5):487–525, 2011.
  24. Federated doubly stochastic kernel learning for vertically partitioned data. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 2483–2493, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3403298. URL https://doi-org.libproxy1.nus.edu.sg/10.1145/3394486.3403298.
  25. Epsilon, 2008. URL https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/binary/epsilon_normalized.bz2.
  26. Gisette, 2008. URL https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/binary/gisette_scale.bz2.
  27. Data valuation for vertical federated learning: An information-theoretic approach. arXiv preprint arXiv:2112.08364, 2021.
  28. Per Christian Hansen. Truncated singular value decomposition solutions to discrete ill-posed problems with ill-determined numerical rank. SIAM Journal on Scientific and Statistical Computing, 11(3):503–518, 1990.
  29. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. arXiv preprint arXiv:1711.10677, 2017.
  30. Fedml: A research library and benchmark for federated machine learning. arXiv preprint arXiv:2007.13518, 2020.
  31. A hybrid self-supervised learning framework for vertical federated learning. arXiv preprint arXiv:2208.08934, 2022.
  32. The oarf benchmark suite: Characterization and implications for federated learning systems. ACM Transactions on Intelligent Systems and Technology (TIST), 13(4):1–32, 2022.
  33. Fdml: A collaborative machine learning framework for distributed features. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 2232–2240, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362016. doi: 10.1145/3292500.3330765. URL https://doi-org.libproxy1.nus.edu.sg/10.1145/3292500.3330765.
  34. Coresets for vertical federated learning: Regularized linear regression and k-means clustering. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 29566–29581. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/be7b70477c8fca697f14b1dbb1c086d1-Paper-Conference.pdf.
  35. Vf-ps: How to select important participants in vertical federated learning, efficiently and securely? In Advances in Neural Information Processing Systems, 2022.
  36. Cafe: Catastrophic data leakage in vertical federated learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 994–1006. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/08040837089cdf46631a10aca5258e16-Paper.pdf.
  37. Iman Khosravi. UCI machine learning repository, 06 2020. URL https://archive.ics.uci.edu/ml/machine-learning-databases/00525/data.zip.
  38. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  39. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  40. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583–621, 1952.
  41. A survey on federated learning systems: vision, hype and reality for data privacy and protection. IEEE Transactions on Knowledge and Data Engineering, 2021a.
  42. Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 965–978. IEEE, 2022a.
  43. Fedtree: A federated learning system for trees. In Proceedings of Machine Learning and Systems, 2023.
  44. Vertical semi-federated learning for efficient online advertising. arXiv preprint arXiv:2209.15635, 2022b.
  45. Federated matrix factorization with privacy guarantee. Proc. VLDB Endow., 15(4):900–913, dec 2021b. ISSN 2150-8097. doi: 10.14778/3503585.3503598. URL https://doi-org.libproxy1.nus.edu.sg/10.14778/3503585.3503598.
  46. Federated forest. IEEE Transactions on Big Data, 8(3):843–854, 2020.
  47. Vertical federated learning. arXiv preprint arXiv:2211.12814, 2022.
  48. Spearman correlation coefficients, differences between. Encyclopedia of statistical sciences, 12, 2004.
  49. The impact of record linkage on learning from feature partitioned data. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8216–8226. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/nock21a.html.
  50. Fairvfl: A fair vertical federated learning framework with contrastive adversarial learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 7852–7865. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/333a7697dbb67f09249337f81c27d749-Paper-Conference.pdf.
  51. David Slate. Letter, 1991. URL https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/multiclass/letter.scale.
  52. The shapley taylor interaction index. In International conference on machine learning, pages 9259–9268. PMLR, 2020.
  53. Benjamin M Taylor. A multi-way correlation coefficient. arXiv preprint arXiv:2003.02561, 2020.
  54. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564, 2018.
  55. Measure contribution of participants in federated learning. In 2019 IEEE international conference on big data (Big Data), pages 2597–2604. IEEE, 2019.
  56. Measures of correlation for multiple variables. arXiv preprint arXiv:1401.4827, 2014.
  57. Fedads: A benchmark for privacy-preserving cvr estimation with vertical federated learning. arXiv preprint arXiv:2305.08328, 2023.
  58. Privacy preserving vertical federated learning for tree-based models. Proc. VLDB Endow., 13(12):2090–2103, jul 2020. ISSN 2150-8097. doi: 10.14778/3407790.3407811. URL https://doi-org.libproxy1.nus.edu.sg/10.14778/3407790.3407811.
  59. A coupled design of exploiting record similarity for practical vertical federated learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 21087–21100. Curran Associates, Inc., 2022a. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/84b744165a0597360caad96b06e69313-Paper-Conference.pdf.
  60. Practical vertical federated learning with unsupervised representation learning. IEEE Transactions on Big Data, 2022b.
  61. Vertibench: Vertical federated learning benchmark, 2023a. URL https://github.com/Xtra-Computing/VertiBench.
  62. Vertibench: Vertical federated learning benchmark (pypi), 2023b. URL https://pypi.org/project/vertibench/.
  63. Assisted learning: A framework for multi-organization learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 14580–14591. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/a7b23e6eefbe6cf04b8e62a6f0915550-Paper.pdf.
  64. Jerrold H Zar. Spearman rank correlation. Encyclopedia of biostatistics, 7, 2005.
  65. Asysqn: Faster vertical federated learning algorithms with better computation resource utilization. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &; Data Mining, KDD ’21, page 3917–3927, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3467169. URL https://doi-org.libproxy1.nus.edu.sg/10.1145/3447548.3467169.
  66. Secure bilevel asynchronous vertical federated learning with backward updating. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10896–10904, 2021b.
Citations (4)

Summary

We haven't generated a summary for this paper yet.