Code Membership Inference for Detecting Unauthorized Data Use in Code Pre-trained Language Models
Abstract: Code pre-trained LLMs (CPLMs) have received great attention since they can benefit various tasks that facilitate software development and maintenance. However, CPLMs are trained on massive open-source code, raising concerns about potential data infringement. This paper launches the study of detecting unauthorized code use in CPLMs, i.e., Code Membership Inference (CMI) task. We design a framework Buzzer for different settings of CMI. Buzzer deploys several inference techniques, including signal extraction from pre-training tasks, hard-to-learn sample calibration and weighted inference, to identify code membership status accurately. Extensive experiments show that CMI can be achieved with high accuracy using Buzzer. Hence, Buzzer can serve as a CMI tool and help protect intellectual property rights.
- Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. In IJCNLP(2), pages 314–319.
- Membership inference attacks from first principles. In IEEE Symposium on Security and Privacy, pages 1897–1914.
- Quantifying memorization across neural language models. In ICLR. https://openreview.net/pdf?id=TatRHT_1cK.
- Codebert: A pre-trained model for programming and natural languages. In EMNLP (Findings), pages 1536–1547.
- Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
- Google. 2023. Get text embeddings. https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings.
- Knowledge distillation: A survey. Int. J. Comput. Vis., 129(6):1789–1819.
- Unixcoder: Unified cross-modal pre-training for code representation. In ACL (1), pages 7212–7225.
- Graphcodebert: Pre-training code representations with data flow. In ICLR. https://openreview.net/pdf?id=jLoC4ez43PZ.
- Membership inference attacks on sequence-to-sequence models: Is my data in your machine translation system? Trans. Assoc. Comput. Linguistics, 8:49–63.
- Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays. PLoS genetics, 4(8):e1000167.
- Membership inference attacks on machine learning: A survey. ACM Comput. Surv., 54(11s):235:1–235:37.
- Deep code comment generation. In ICPC, pages 200–210.
- Cosqa: 20, 000+ web queries for code search and question answering. In ACL/IJCNLP (1), pages 5690–5700.
- Zehao Huang and Naiyan Wang. 2017. Like what you like: Knowledge distill via neuron selectivity transfer. arXiv Preprint. https://arxiv.org/abs/1707.01219.
- Codesearchnet challenge: Evaluating the state of semantic code search. arXiv Preprint. https://arxiv.org/abs/1909.09436.
- InfoQ. 2022. First open source copyright lawsuit challenges github copilot. https://www.infoq.com/news/2022/11/lawsuit-github-copilot/.
- Membership inference attack susceptibility of clinical language models. arXiv Preprint. https://arxiv.org/abs/2104.08305.
- Jan Kammerath. 2023. Copilot leaks: Code i should not have seen. https://medium.com/@jankammerath/copilot-leaks-code-i-should-not-have-seen-e4bda9b33ba6.
- Starcoder: may the source be with you! arXiv Preprint. https://arxiv.org/abs/2305.06161.
- Zheng Li and Yang Zhang. 2021. Membership leakage in label-only exposures. In CCS, pages 880–895.
- Refbert: A two-stage pre-trained framework for automatic rename refactoring. In ISSTA, pages 740–752.
- Roberta: A robustly optimized BERT pretraining approach. arXiv Preprint. https://arxiv.org/abs/1907.11692.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR (Poster). https://openreview.net/pdf?id=Bkg6RiCqY7.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks.
- Membership inference on word embedding and beyond. arXiv Preprint. https://arxiv.org/abs/2106.11384.
- Quantifying privacy risks of masked language models using membership inference attacks. In EMNLP, pages 8332–8347.
- OpenAI. 2021. Openai codex. https://openai.com/blog/openai-codex.
- OpenAI. 2023. Embeddings - use cases. https://platform.openai.com/docs/guides/embeddings/use-cases.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Code llama: Open foundation models for code. arXiv Preprint. https://arxiv.org/abs/2308.12950.
- Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. WIREs Data Mining Knowl. Discov., 8(4).
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv Preprint. https://arxiv.org/abs/1910.01108.
- Neural machine translation of rare words with subword units. In ACL (1), pages 1715–1725.
- Membership inference attacks against nlp classification models. In NeurIPS 2021 Workshop Privacy in Machine Learning.
- Congzheng Song and Ananth Raghunathan. 2020. Information leakage in embedding models. In CCS, pages 377–390.
- Patient knowledge distillation for BERT model compression. In EMNLP/IJCNLP (1), pages 4322–4331.
- Intellicode compose: code generation using transformer. In ESEC/SIGSOFT FSE, pages 1433–1443.
- Memorization without overfitting: Analyzing the training dynamics of large language models. In NeurIPS, pages 38274–38290.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. J. Mach. Learn. Res., 9(86):2579–2605.
- Attention is all you need. In NIPS, pages 5998–6008.
- Bridging pre-trained models and downstream tasks for source code understanding. In ICSE, pages 287–298.
- Codet5+: Open code large language models for code understanding and generation. arXiv Preprint. https://arxiv.org/abs/2305.07922.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP (1), pages 8696–8708.
- Debiasing learning for membership inference attacks against recommender systems. In KDD, pages 1959–1968.
- On the importance of difficulty calibration in membership inference attacks. In ICLR. https://openreview.net/pdf?id=3eIrli0TwQ.
- Gotcha! this model uses my code! evaluating membership leakage risks in code models. arXiv Preprint. https://arxiv.org/abs/2310.01166.
- Enhanced membership inference attacks against machine learning models. In CCS, pages 3093–3106.
- Membership inference attacks against recommender systems. In CCS, pages 864–879.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.