From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying (2310.04145v2)
Abstract: Safeguarding the Intellectual Property (IP) of data has become critically important as machine learning applications continue to proliferate, and their success heavily relies on the quality of training data. While various mechanisms exist to secure data during storage, transmission, and consumption, fewer studies have been developed to detect whether they are already leaked for model training without authorization. This issue is particularly challenging due to the absence of information and control over the training process conducted by potential attackers. In this paper, we concentrate on the domain of tabular data and introduce a novel methodology, Local Distribution Shifting Synthesis (\textsc{LDSS}), to detect leaked data that are used to train classification models. The core concept behind \textsc{LDSS} involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset. This enables the effective identification of models trained on leaked data through model querying alone, as the synthetic data injection results in a pronounced disparity in the predictions of models trained on leaked and modified datasets. \textsc{LDSS} is \emph{model-oblivious} and hence compatible with a diverse range of classification models. We have conducted extensive experiments on seven types of classification models across five real-world datasets. The comprehensive results affirm the reliability, robustness, fidelity, security, and efficiency of \textsc{LDSS}. Extending \textsc{LDSS} to regression tasks further highlights its versatility and efficacy compared with baseline methods.
- Turning Your Weakness into a Strength: Watermarking Deep Neural Networks by Backdooring. In USENIX Security. 1615–1631.
- Patricia L Bellia. 2011. WikiLeaks and the institutional framework for national security disclosures. Yale LJ 121 (2011), 1448.
- Machine learning techniques for credit risk evaluation: a systematic literature review. Journal of Banking and Financial Technology 4 (2020), 111–138.
- Robust Image Watermarking based on Multiband Wavelets and Empirical Mode Decomposition. TIP 16, 8 (2007), 1956–1966.
- Franziska Boenisch. 2021. A Systematic Review on Model Watermarking for Neural Networks. Frontiers in Big Data 4 (2021), 729663.
- Cosine model watermarking against ensemble distillation. In AAAI, Vol. 36. 9512–9520.
- SMOTE: Synthetic Minority Over-Sampling Technique. JAIR 16 (2002), 321–357.
- Blackmarks: Blackbox Multibit Watermarking for Deep Neural Networks. arXiv preprint arXiv:1904.00344 (2019).
- Targeted Backdoor Attacks on Deep Learning Systems using Data Poisoning. arXiv preprint arXiv:1712.05526 (2017).
- Enterprise data breach: causes, challenges, prevention, and future directions. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7, 5 (2017), e1211.
- Digital Watermarking and Steganography. Morgan Kaufmann.
- Deepsigns: An End-to-End Watermarking Framework for Ownership Protection of Deep Neural Networks. In ASPLOS. 485–497.
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
- Supervised GAN Watermarking for Intellectual Property Protection. In IEEE International Workshop on Information Forensics and Security. 1–6.
- Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Rev. 53, 2 (2011), 217–288.
- Embedding Image Watermarks in DC Components. TCSVT 10, 6 (2000), 974–979.
- Point-to-Hyperplane Nearest Neighbor Search Beyond the Unit Hypersphere. In SIGMOD. 777–789.
- Feature selection and classification model construction on type 2 diabetic patients’ data. Artificial Intelligence in Medicine 41, 3 (2007), 251–262.
- Billion-Scale Similarity Search with GPUs. TBD 7, 3 (2021), 535–547.
- Donald E Knuth. 2014. The Art of Computer Programming: Seminumerical Algorithms, volume 2. Addison-Wesley Professional.
- Adversarial Frontier Stitching for Remote Neural Network Watermarking. Neural Computing and Applications 32, 13 (2020), 9233–9244.
- A Stochastic Search Approach for the Multidimensional Largest Empty Sphere Problem. (2004), 1–11.
- Sublinear Time Nearest Neighbor Search over Generalized Weighted Space. In ICML. 3773–3781.
- Locality-Sensitive Hashing Scheme based on Longest Circular Co-Substring. In SIGMOD. 2589–2599.
- PLMmark: a secure and robust black-box watermarking framework for pre-trained language models. In AAAI, Vol. 37. 14991–14999.
- A Survey of Deep Neural Network Watermarking Techniques. Neurocomputing 461 (2021), 171–193.
- Open-sourced Dataset Protection via Backdoor Watermarking. arXiv preprint arXiv:2010.05821 (2020).
- Isolation Forest. In ICDM. 413–422.
- Thomas P Minka. 2000. Automatic Choice of Dimensionality for PCA. In NIPS. 577–583.
- Ryota Namba and Jun Sakuma. 2019. Robust Watermarking of Neural Network with Exponential Weighting. In AsiaCCS. 228–240.
- Ramaswamy Palaniappan and Danilo P Mandic. 2007. Biometrics from brain electrical activity: A machine learning approach. TPAMI 29, 4 (2007), 738–742.
- Intellectual Property Protection of DNN Models. World Wide Web 26, 4 (2023), 1877–1911.
- A Survey of Digital Image Watermarking Techniques. In IEEE International Conference on Industrial Informatics. 709–716.
- A novel model watermarking for protecting generative adversarial network. Computers & Security 127 (2023), 103102.
- Watermarking deep neural networks in image processing. TNNLS 32, 5 (2020), 1852–1865.
- Mlaas: Machine learning as a service. In IEEE 14th International Conference on Machine Learning and Applications. 896–902.
- Hidden trigger backdoor attacks. In AAAI, Vol. 34. 11957–11965.
- Poison Frogs! Targeted Clean-label Poisoning Attacks on Neural Networks. In NeurIPS. 6106–6116.
- An approach for prediction of loan approval using machine learning algorithm. In 2020 International Conference on Electronics and Sustainable Communication Systems. 490–494.
- Leakiness and creepiness in app space: Perceptions of privacy and mobile app use. In SIGCHI. 2347–2356.
- Membership Inference Attacks Against Machine Learning Models. In S&P. 3–18.
- Machine Learning Models that Remember Too Much. In CCS. 587–601.
- Dawn: Dynamic adversarial watermarking of neural networks. In MM. 4417–4425.
- Michael E Tipping and Christopher M Bishop. 1999. Mixtures of Probabilistic Principal Component Analyzers. Neural Computation 11, 2 (1999), 443–482.
- Demystifying Membership Inference Attacks in Machine Learning As A Service. TSC 14, 06 (2021), 2073–2089.
- Embedding watermarks into deep neural networks. In ICMR. 269–277.
- Watermarking in Deep Neural Networks via Error Back-Propagation. Electronic Imaging 2020, 4 (2020), 22–1.
- Yumin Wang and Hanzhou Wu. 2022. Protecting the intellectual property of speaker recognition model by black-box watermarking in the frequency domain. Symmetry 14, 3 (2022), 619.
- Watermarking neural networks with watermarked images. TCSVT 31, 7 (2020), 2591–2601.
- Intellectual Property Protection for Deep Learning Models: Taxonomy, Methods, Attacks, and Evaluations. TAI 3, 6 (2021), 908–923.
- Robust watermarking for deep neural networks via bi-level optimization. In ICCV. 14841–14850.
- Exploring structure consistency for deep model watermarking. arXiv preprint arXiv:2108.02360 (2021).
- Model Watermarking for Image Processing Networks. In AAAI, Vol. 34. 12805–12812.
- Deep Model Intellectual Property Protection via Deep Watermarking. TPAMI 44, 8 (2021), 4005–4020.
- Protecting Intellectual Property of Deep Neural Networks with Watermarking. In AsiaCCS. 159–172.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.