Learning Defect Prediction from Unrealistic Data (2311.00931v2)
Abstract: Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world applications
- C. Tantithamthavorn, S. McIntosh, A. E. Hassan, and K. Matsumoto, “Automated parameter optimization of classification techniques for defect prediction models.” in Proceedings of the 38th international conference on software engineering, 2016.
- C. Tantithamthavorn, A. E. Hassan, and K. Matsumoto, “The impact of class rebalancing techniques on the performance and interpretation of defect prediction models.” in IEEE Transactions on Software Engineering, 2020.
- J. Sohn, Y. Kamei, S. McIntosh, and S. Yoo, “Leveraging fault localisation to enhance defect prediction.” in IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2021.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in arXiv preprint arXiv:1810.04805, 2018.
- M. Vasic, A. Kanade, P. Maniatis, D. Bieber, and R. Singh, “Neural program repair by jointly learning to localize and repair,” arXiv preprint arXiv:1904.01720, 2019.
- A. Linden, “Is synthetic data the future of ai?.” in Gartner News: [Online] Available: https://www.gartner.com/en/newsroom/press-releases/2022-06-22-is-synthetic-data-the-future-of-ai, 2022.
- R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, and M. McConley, “Automated vulnerability detection in source code using deep representation learning,” in Proceedings of the 17th IEEE International Conference on Machine Learning, 2018.
- M. Allamanis, H.Jackson-Flux, and M. Brockschmidt, “Self-supervised bug detection and repair,” in Advances in Neural Information Processing Systems:, 2022.
- V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber, “Global relational models of source code,” in International conference on learning representations, 2019.
- B. Sorscher, R. Geirhos, S. Shekhar, S. Ganguli, and A. S. Morcos, “Beyond neural scaling laws: beating power law scaling via data pruning,” arXiv preprint arXiv:2206.14486, 2022.
- V. Okun, A. Delaitre, and P. E. Black, “Report on the static analysis tool exposition (sate),” in NIST Special Publication, vol. 500, p. 297, 2013.
- N. I. of Standards and Technology, “Software assurance reference dataset,,” in [Online]. Available: https://samate.nist.gov/SRD/index.php, 2018.
- S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet?” in Proceedings of the 2021 on IEEE Transactions on Software Engineering, 2022.
- R. Veselin, M. Vechev, and E. Yahav, “Code completion with statistical language models.” in the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014.
- D. Zou, S. Wang, S. Xu, Z. Li, and H. Jin, “Vuldeepecker: A deep learning-based system for multiclass vulnerability detection,” in IEEE Transactions on Dependable and Secure Computing, 2019.
- G. Jie, K. Xiao-Hui, and L. Qiang, “Survey on software vulnerability analysis method based on machine learning,” in IEEE First International Conference on Data Science in Cyberspace (DSC), 2016.
- Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu., “Vulpecker: an automated vulnerability detection system based on code similarity analysis,” in Proceedings of the 32nd Annual Conference on Computer Security Applications. ACM, 2016.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, and J. D. K. el, “Language models are few-shot learners,” in Advances in neural information processing systems 33, 2020.
- Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014.
- J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.
- Q. Liu, M. J. Kusner, and P. Blunsom, “A survey on contextual embeddings,” in arXiv preprint arXiv:2003.07278, 2020.
- J. Johnson, M. Douze, and H. Egou, “Billion-scale similarity search with gpus.” in IEEE Transactions on Big Data 7.3, 2019.
- S. Lu, D. Guo, S. Ren, J. Huang, A.Svyatkovskiy, A. Blanco, C. B. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu, “Codexglue: A machine learning benchmark dataset for code understanding and generation.” in CoRR abs/2102.04664, 2021.
- Y. Zho, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,,” in Advances in Neural Information Processing Systems, 2019.
- Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation.” in arXiv preprint arXiv:2109.00859, 2021.
- W. Ahmad, S. Chakraborty, B. Ray, and K. Chang, “Unified pre-training for program understanding and generation.” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, 2021.
- C. Developers, “Cppcheck: A tool for static c/c++ code analysis,” in [Online]. Available: http://cppcheck.sourceforge.net, 2019.
- L. Developers, “Clang,” in [Online]. Available: clang.llvm.org, 2014.
- J. Carmack, “Flawfinder,” in [Online]. Available: http://dwheeler.com/flawfinder, 2001.
- R.-M. Karampatsis and C. Sutton, “How often do single-statement bugs occur? the manysstubs4j dataset,” in Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 573–577.
- A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning and evaluating contextual embedding of source code,” International Conference on Machine Learning, no. 5110–5121, 2020.
- M. Allamanis, “The adverse effects of code duplication in machine learning models of code,” in Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, 2019.
- L. Maaten and G. Hinton, “Visualizing data using t-sne,” in Journal of machine learning research, 2008.
- Z. Feng, D.Guo, D. Tang, N. Duan, M. G. X. Feng, L. Shou, B. Qin, T. Liu, and D. Jiang, “Codebert: A pretrained model for programming and natural languages.” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, 2020.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, 2020.
- H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search.” in CoRR, abs/1909.09436, 2019.
- Z. Chen, V. J. Hellendoorn, P. Lamblin, P. Maniatis, P.-A. Manzagol, D. Tarlow, and S. Moitra, “Plur: A unifying, graph-based view of program learning, understanding, and repair,” in Advances in Neural Information Processing Systems, 2022.
- M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov, “Generative code modeling with graphs,” in International Conference on Learning Representations (ICLR), 2018.
- M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs.” in International Conference on Learning Representations, 2014.
- M. Pradel and K. Sen, “Deepbugs: A learning approach to name-based bug detection,” Proceedings of the ACM on Programming Languages, 2(OOPSLA),, 2018.
- E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang, “Hoppity: Learning graph transformations to detect and fix bugs in programs,” International Conference on Learning Representations (ICLR), 2018.
- M. Fu, C. Tantithamthavorn, T. Le, V. Nguyen, and D. Phung, “Vulrepair: A t5-based automated software vulnerability repair,” in Proceedings of joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2022.
- Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, “Hoppity: Learning graph transformations to detect and fix bugs in programs,” IEEE Transactions on Software Engineering, 2021.
- J. He, L. Beurer-Kellner, and M. Vechev, “On distribution shift in learning-based bug detectors,” in International Conference on Machine Learning, 2022.
- Y. Ding, S. Suneja, Y. Zheng, J. Laredo, A. Morari†, G. Kaiser, and B. Ray, “On the use of fine-grained vulnerable code statements for software vulnerability assessment models.” in 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022.
- J. Patra and M. Pradel, “Semantic bug seeding: A learning-based approach for creating realistic bugs,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021.
- Z. Chen, Z. Li, S. Wang, D. Fu, and F. Zhao, “Learning from noisy data for semi-supervised 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6929–6939.
- Y. Yao, M. Gong, Y. Du, J. Yu, B. Han, K. Zhang, and T. Liu, “Which is better for learning with noisy labels: the semi-supervised method or modeling label noise?” in International Conference on Machine Learning. PMLR, 2023, pp. 39 660–39 673.
- Kamel Alrashedy (5 papers)
- Vincent J. Hellendoorn (16 papers)
- Alessandro Orso (15 papers)