Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Differentially Private Synthetic Data via Foundation Model APIs 1: Images (2305.15560v3)

Published 24 May 2023 in cs.CV, cs.CR, and cs.LG

Abstract: Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID <= 7.9 with privacy cost {\epsilon} = 0.67, significantly improving the previous SOTA from {\epsilon} = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. The code and data are released at https://github.com/microsoft/DPSDA.

Overview of: Differentially Private Synthetic Data via Foundation Model APIs 1: Images

The paper introduces a novel, efficient framework that tackles the generation of differentially private (DP) synthetic data using blackbox foundation model APIs, specifically focusing on images. This novel approach, termed Private Evolution (PE), addresses privacy concerns by leveraging foundation models for synthetic data generation without requiring access to model weights or training processes. The task is challenging due to restrictive model access and privacy concerns between users and API providers. Nonetheless, PE shows promise in generating synthetic data with competitive or even superior privacy and utility trade-offs compared to state-of-the-art (SOTA) training-based methods.

Framework and Methodology

The proposed PE framework iteratively improves synthetic data generation using evolutionary algorithms. The framework includes several key steps:

  1. Initial Population Generation: Random samples are generated using APIs like DALLE 2 or Stable Diffusion without training the models, forming an initial population.
  2. Population Evolution: In each iteration, a fitness function, known as the DP Nearest Neighbors Histogram, evaluates sample similarity to private data. This metric is computed in an embedding space using techniques such as inception embeddings from pre-trained networks.
  3. Privacy Preservation: Gaussian noise addition ensures differential privacy in the histogram, with an algorithm-wide sensitivity analysis guaranteeing privacy protection.
  4. Offspring Generation: Samples are selected and modified using APIs to form new generations, iteratively concentrating samples within the private data distribution.
  5. Conditional Generation: PE is extended to handle labeled datasets by running the algorithm per class, effectively supporting conditional image generation.

Empirical Results

The experimental results highlight PE's competence across several datasets, with remarkable findings:

  • CIFAR10: PE surpasses previous methods with a privacy cost (ε = 0.67) to achieve a Fréchet Inception Distance (FID) ≤ 7.9, demonstrating lower privacy costs and better FID scores than prior approaches like DP-Diffusion.
  • Camelyon17: Despite the significant distribution gap from natural image datasets, PE achieves meaningful generation performance, maintaining an 80% classification accuracy, albeit trailing slightly behind the training-based methodologies.
  • Stable Diffusion Benchmarks: Preliminary experiments with high-resolution private datasets using Stable Diffusion validate PE's capacity to effectively utilize advanced foundation models for DP synthetic image generation.

Critical Insights and Directions

The paper highlights the potential of leveraging foundation model APIs for synthetic data generation, steering a departure from traditional model training methodologies. The proposed PE framework can significantly democratize DP synthetic data deployment by reducing implementation complexity and resource requirements. However, the framework's success hinges on suitable public data distributions, particularly when the private data deviates dramatically. Future works could explore PE’s applications with additional APIs or investigate further refinements to enhance convergence rates and handle datasets of varying modalities and complexities.

Moreover, the investigation raises interesting theoretical implications about DP clustering and its relation to intrinsic dimensions of data when addressing high-dimensional privacy-preserving tasks. Lastly, the authors point out the potential for PE to be incorporated into more extensive privacy frameworks alongside institutions, providing a pragmatic pathway to safer data-sharing protocols.

In summary, the paper presents Private Evolution as a compelling strategy for extending the capabilities of foundation models into the area of differentially private synthetic data generation, demonstrating both theoretical novelty and empirical strength. The framework opens up new avenues for AI research and applications, especially in sectors where data privacy is paramount.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zinan Lin (42 papers)
  2. Sivakanth Gopi (37 papers)
  3. Janardhan Kulkarni (52 papers)
  4. Harsha Nori (24 papers)
  5. Sergey Yekhanin (19 papers)
Citations (21)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com