Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities (2410.24015v1)
Abstract: Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.
- Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
- Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022.
- Edgeface: Efficient face recognition model for edge devices. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2024.
- Magface: A universal representation for face recognition and quality assessment. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14225–14234, 2021.
- Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018.
- Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision, pages 87–102. Springer, 2016.
- Webface260m: A benchmark unveiling the power of million-scale deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10492–10502, 2021.
- Sface: Privacy-friendly and accurate face recognition using synthetic data. In 2022 IEEE International Joint Conference on Biometrics (IJCB), pages 1–11. IEEE, 2022.
- Identity-driven three-player generative adversarial network for synthetic-based face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 806–816, 2023.
- Dcface: Synthetic face generation with dual condition diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12715–12725, 2023.
- Idiff-face: Synthetic-based face recognition through fizzy identity-conditioned diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19650–19661, 2023.
- Gandiffface: Controllable generation of synthetic datasets for face recognition with realistic variations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3086–3095, 2023.
- Digiface-1m: 1 million digital face images for face recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3526–3535, 2023.
- Sdfr: Synthetic data for face recognition competition. In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), pages 1–9. IEEE, 2024.
- Synthdistill: Face recognition with knowledge distillation from synthetic data. In IEEE International Joint Conference on Biometrics (IJCB 2023), 2023.
- Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023.
- On provable copyright protection for generative models. In International Conference on Machine Learning, pages 35277–35299. PMLR, 2023.
- Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023a.
- Understanding and mitigating copying in diffusion models. Advances in Neural Information Processing Systems, 36:47783–47803, 2023b.
- Shake to leak: Fine-tuning diffusion models can amplify the generative privacy risk. arXiv preprint arXiv:2403.09450, 2024.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- Membership inference attacks against synthetic health data. Journal of biomedical informatics, 125:103977, 2022.
- This face does not exist… but it might be yours! identity leakage in generative models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1320–1328, 2021.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8110–8119, 2020.
- Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
- Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 2021.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- IARPA janus benchmark-c: Face dataset and protocol. In Proceedings of the International Conference on Biometrics (ICB), pages 158–165. IEEE, 2018.
- Lifespan age transformation synthesis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 739–755. Springer, 2020.
Collections
Sign up for free to add this paper to one or more collections.