Papers
Topics
Authors
Recent
2000 character limit reached

Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models

Published 12 Dec 2023 in cs.SE | (2312.07200v2)

Abstract: Code pre-trained LLMs (CPLMs) have received great attention since they can benefit various tasks that facilitate software development and maintenance. However, CPLMs are trained on massive open-source code, raising concerns about potential data infringement. This paper launches the study of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference (CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer deploys several inference techniques, including signal extraction from pre-training tasks, hard-to-learn sample calibration and weighted inference, to identify code membership status accurately. Extensive experiments show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer can serve as a CMI tool and help protect intellectual property rights.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. In IJCNLP(2), pages 314–319.
  2. Membership inference attacks from first principles. In IEEE Symposium on Security and Privacy, pages 1897–1914.
  3. Quantifying memorization across neural language models. In ICLR. https://openreview.net/pdf?id=TatRHT_1cK.
  4. Codebert: A pre-trained model for programming and natural languages. In EMNLP (Findings), pages 1536–1547.
  5. Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
  6. Google. 2023. Get text embeddings. https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings.
  7. Knowledge distillation: A survey. Int. J. Comput. Vis., 129(6):1789–1819.
  8. Unixcoder: Unified cross-modal pre-training for code representation. In ACL (1), pages 7212–7225.
  9. Graphcodebert: Pre-training code representations with data flow. In ICLR. https://openreview.net/pdf?id=jLoC4ez43PZ.
  10. Membership inference attacks on sequence-to-sequence models: Is my data in your machine translation system? Trans. Assoc. Comput. Linguistics, 8:49–63.
  11. Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS genetics, 4(8):e1000167.
  12. Membership inference attacks on machine learning: A survey. ACM Comput. Surv., 54(11s):235:1–235:37.
  13. Deep code comment generation. In ICPC, pages 200–210.
  14. Cosqa: 20, 000+ web queries for code search and question answering. In ACL/IJCNLP (1), pages 5690–5700.
  15. Zehao Huang and Naiyan Wang. 2017. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv Preprint. https://arxiv.org/abs/1707.01219.
  16. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv Preprint. https://arxiv.org/abs/1909.09436.
  17. InfoQ. 2022. First open source copyright lawsuit challenges github copilot. https://www.infoq.com/news/2022/11/lawsuit-github-copilot/.
  18. Membership inference attack susceptibility of clinical language models. arXiv Preprint. https://arxiv.org/abs/2104.08305.
  19. Jan Kammerath. 2023. Copilot leaks: Code i should not have seen. https://medium.com/@jankammerath/copilot-leaks-code-i-should-not-have-seen-e4bda9b33ba6.
  20. Starcoder: may the source be with you! arXiv Preprint. https://arxiv.org/abs/2305.06161.
  21. Zheng Li and Yang Zhang. 2021. Membership leakage in label-only exposures. In CCS, pages 880–895.
  22. Refbert: A two-stage pre-trained framework for automatic rename refactoring. In ISSTA, pages 740–752.
  23. Roberta: A robustly optimized BERT pretraining approach. arXiv Preprint. https://arxiv.org/abs/1907.11692.
  24. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR (Poster). https://openreview.net/pdf?id=Bkg6RiCqY7.
  25. Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks.
  26. Membership inference on word embedding and beyond. arXiv Preprint. https://arxiv.org/abs/2106.11384.
  27. Quantifying privacy risks of masked language models using membership inference attacks. In EMNLP, pages 8332–8347.
  28. OpenAI. 2021. Openai codex. https://openai.com/blog/openai-codex.
  29. OpenAI. 2023. Embeddings - use cases. https://platform.openai.com/docs/guides/embeddings/use-cases.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  31. Code llama: Open foundation models for code. arXiv Preprint. https://arxiv.org/abs/2308.12950.
  32. Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. WIREs Data Mining Knowl. Discov., 8(4).
  33. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv Preprint. https://arxiv.org/abs/1910.01108.
  34. Neural machine translation of rare words with subword units. In ACL (1), pages 1715–1725.
  35. Membership inference attacks against nlp classification models. In NeurIPS 2021 Workshop Privacy in Machine Learning.
  36. Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In CCS, pages 377–390.
  37. Patient knowledge distillation for BERT model compression. In EMNLP/IJCNLP (1), pages 4322–4331.
  38. Intellicode compose: code generation using transformer. In ESEC/SIGSOFT FSE, pages 1433–1443.
  39. Memorization without overfitting: Analyzing the training dynamics of large language models. In NeurIPS, pages 38274–38290.
  40. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. J. Mach. Learn. Res., 9(86):2579–2605.
  41. Attention is all you need. In NIPS, pages 5998–6008.
  42. Bridging pre-trained models and downstream tasks for source code understanding. In ICSE, pages 287–298.
  43. Codet5+: Open code large language models for code understanding and generation. arXiv Preprint. https://arxiv.org/abs/2305.07922.
  44. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP (1), pages 8696–8708.
  45. Debiasing learning for membership inference attacks against recommender systems. In KDD, pages 1959–1968.
  46. On the importance of difficulty calibration in membership inference attacks. In ICLR. https://openreview.net/pdf?id=3eIrli0TwQ.
  47. Gotcha! this model uses my code! evaluating membership leakage risks in code models. arXiv Preprint. https://arxiv.org/abs/2310.01166.
  48. Enhanced membership inference attacks against machine learning models. In CCS, pages 3093–3106.
  49. Membership inference attacks against recommender systems. In CCS, pages 864–879.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.