Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding (2402.15769v2)

Published 24 Feb 2024 in cs.SE and cs.AI

Abstract: Pre-trained code models lead the era of code intelligence with multiple models have been designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code transformation techniques to generate new code candidates first and then selects important ones as the training data by importance metrics. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5). Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):1–37.
  2. Self-supervised bug detection and repair. In Advances in Neural Information Processing Systems.
  3. Pavol Bielik and Martin Vechev. 2020. Adversarial robustness for code. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 896–907. PMLR.
  4. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pages 511–521, New York, NY, USA. Association for Computing Machinery.
  5. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703.
  6. Detecting cryptography misuses with machine learning: Graph embeddings, transfer learning and data augmentation in source code related tasks. IEEE Transactions on Reliability.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding.
  8. Hoppity: Learning graph transformations to detect and fix bugs in programs. In International Conference on Learning Representations.
  9. Mixcode: Enhancing code classification by mixup-based data augmentation. In SANER, pages 379–390.
  10. Boosting source code learning with data augmentation: An empirical study. arXiv preprint arXiv:2303.06808.
  11. On the effectiveness of graph data augmentation for source code learning. Knowledge-Based Systems, 285:111328.
  12. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
  13. Codebert: a pre-trained model for programming and natural languages. pages 1536–1547.
  14. Fuzz testing based data augmentation to improve robustness of deep neural networks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1147–1158, New York, NY, USA. Association for Computing Machinery.
  15. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366.
  16. Re-factoring based program repair applied to programming assignments. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 388–398. IEEE.
  17. Diederik P Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  18. Code smells and refactoring: A tertiary systematic review of challenges and observations. Journal of Systems and Software, 167:110610.
  19. Codexglue: a machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS).
  20. Andrzej Maćkiewicz and Waldemar Ratajczak. 1993. Principal components analysis (pca). Computers & Geosciences, 19(3):303–342.
  21. Contrastive learning with keyword-based data augmentation for code search and code question answering. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3609–3619, Dubrovnik, Croatia. Association for Computational Linguistics.
  22. A search-based testing framework for deep neural networks of source code embedding. In 14th IEEE Conference on Software Testing, Verification and Validation (ICST), pages 36–46, Los Alamitos, CA, USA. IEEE Computer Society.
  23. Codenet: a large-scale ai for code dataset for learning a diversity of coding tasks.
  24. Correlation coefficients: appropriate use and interpretation. Anesthesia & analgesia, 126(5):1763–1768.
  25. On the importance of building high-quality training datasets for neural code search. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1609–1620, New York, NY, USA. Association for Computing Machinery.
  26. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution, pages 476–480. IEEE.
  27. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th International Conference on Software Engineering, pages 287–298, New York, NY, USA. Association for Computing Machinery.
  28. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 261–271. IEEE.
  29. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  30. Reinforcing adversarial robustness using model confidence induced by adversarial training. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5334–5342. PMLR.
  31. Unsupervised data augmentation for consistency training. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  32. Exploitgen: Template-augmented exploit code generation based on codebert. Journal of Systems and Software, 197:111577.
  33. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1482–1493, New York, NY, USA. Association for Computing Machinery.
  34. A survey of automated data augmentation algorithms for deep learning-based image classification tasks. Knowledge and Information Systems, 65(7):2805–2861.
  35. Michihiro Yasunaga and Percy Liang. 2020. Graph-based, self-supervised program repair from diagnostic feedback. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  36. Adversarial examples for models of code. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–30.
  37. Data augmentation by program transformation. Journal of Systems and Software, 190:111304.
  38. mixup: beyond empirical risk minimization. In International Conference on Learning Representations (ICLR).
  39. Generating adversarial examples for holding robustness of source code processing models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1169–1176.
  40. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zeming Dong (4 papers)
  2. Qiang Hu (149 papers)
  3. Xiaofei Xie (106 papers)
  4. Maxime Cordy (61 papers)
  5. Mike Papadakis (64 papers)
  6. Jianjun Zhao (63 papers)

Summary

We haven't generated a summary for this paper yet.