Papers
Topics
Authors
Recent
2000 character limit reached

Watermarking Generative Tabular Data (2405.14018v1)

Published 22 May 2024 in cs.CR, stat.AP, and cs.LG

Abstract: In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  2. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  3. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
  4. Unsupervised foreground extraction via deep region competition. Advances in Neural Information Processing Systems, 34:14264–14279, 2021.
  5. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  6. Latent diffusion energy-based model for interpretable text modeling. arXiv preprint arXiv:2206.05895, 2022.
  7. Learning concept-based visual causal transition and symbolic reasoning for visual planning. arXiv preprint arXiv:2310.03325, 2023.
  8. Learning energy-based prior model with diffusion-amortized mcmc. Advances in Neural Information Processing Systems, 36, 2024.
  9. Object-conditioned energy-based attention map alignment in text-to-image diffusion models. arXiv preprint arXiv:2404.07389, 2024.
  10. A watermark for large language models. In International Conference on Machine Learning, pages 17061–17084. PMLR, 2023.
  11. Provable robust watermarking for ai-generated text. In The Twelfth International Conference on Learning Representations, 2023a.
  12. Tree-rings watermarks: Invisible fingerprints for diffusion images. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Z57JrmubNl.
  13. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  14. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pages 97–112. PMLR, 2021.
  15. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR, 2023.
  16. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  17. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  18. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
  19. Invisible image watermarks are provably removable using generative ai. Saastha Vasan, Ilya Grishchenko, Christopher Kruegel, Giovanni Vigna, Yu-Xiang Wang, and Lei Li,“Invisible image watermarks are provably removable using generative ai,” Aug, 2023b.
  20. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
  21. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194, 2023.
  22. Permute-and-flip: An optimally robust and watermarkable decoder for llms. arXiv preprint arXiv:2402.05864, 2024.
  23. Distillation-resistant watermarking for model protection in nlp. arXiv preprint arXiv:2210.03312, 2022.
  24. Protecting language generation models via invisible watermarking. In International Conference on Machine Learning, pages 42187–42199. PMLR, 2023c.
  25. Learning to watermark llm-generated text via reinforcement learning. arXiv preprint arXiv:2403.10553, 2024.
  26. Tree-rings watermarks: Invisible fingerprints for diffusion images. Advances in Neural Information Processing Systems, 36, 2024.
  27. The stable signature: Rooting watermarks in latent diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22466–22477, 2023.
  28. Watermarks in the sand: Impossibility of strong watermarking for generative models. arXiv preprint arXiv:2311.04378, 2023.
  29. Vidmantas Bentkus. A lyapunov-type bound in rd. Theory of Probability & Its Applications, 49(2):311–323, 2005.
  30. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  31. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.