Papers
Topics
Authors
Recent
Search
2000 character limit reached

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Published 24 Feb 2024 in cs.SE and cs.AI | (2402.15769v3)

Abstract: Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific LLMs (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):1–37.
  2. Self-supervised bug detection and repair. In Advances in Neural Information Processing Systems.
  3. Pavol Bielik and Martin Vechev. 2020. Adversarial robustness for code. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 896–907. PMLR.
  4. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pages 511–521, New York, NY, USA. Association for Computing Machinery.
  5. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703.
  6. Detecting cryptography misuses with machine learning: Graph embeddings, transfer learning and data augmentation in source code related tasks. IEEE Transactions on Reliability.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding.
  8. Hoppity: Learning graph transformations to detect and fix bugs in programs. In International Conference on Learning Representations.
  9. Mixcode: Enhancing code classification by mixup-based data augmentation. In SANER, pages 379–390.
  10. Boosting source code learning with data augmentation: An empirical study. arXiv preprint arXiv:2303.06808.
  11. On the effectiveness of graph data augmentation for source code learning. Knowledge-Based Systems, 285:111328.
  12. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
  13. Codebert: a pre-trained model for programming and natural languages. pages 1536–1547.
  14. Fuzz testing based data augmentation to improve robustness of deep neural networks. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1147–1158, New York, NY, USA. Association for Computing Machinery.
  15. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366.
  16. Re-factoring based program repair applied to programming assignments. In 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 388–398. IEEE.
  17. Diederik P Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  18. Code smells and refactoring: A tertiary systematic review of challenges and observations. Journal of Systems and Software, 167:110610.
  19. Codexglue: a machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS).
  20. Andrzej Maćkiewicz and Waldemar Ratajczak. 1993. Principal components analysis (pca). Computers & Geosciences, 19(3):303–342.
  21. Contrastive learning with keyword-based data augmentation for code search and code question answering. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3609–3619, Dubrovnik, Croatia. Association for Computational Linguistics.
  22. A search-based testing framework for deep neural networks of source code embedding. In 14th IEEE Conference on Software Testing, Verification and Validation (ICST), pages 36–46, Los Alamitos, CA, USA. IEEE Computer Society.
  23. Codenet: a large-scale ai for code dataset for learning a diversity of coding tasks.
  24. Correlation coefficients: appropriate use and interpretation. Anesthesia & analgesia, 126(5):1763–1768.
  25. On the importance of building high-quality training datasets for neural code search. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1609–1620, New York, NY, USA. Association for Computing Machinery.
  26. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution, pages 476–480. IEEE.
  27. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th International Conference on Software Engineering, pages 287–298, New York, NY, USA. Association for Computing Machinery.
  28. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 261–271. IEEE.
  29. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  30. Reinforcing adversarial robustness using model confidence induced by adversarial training. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5334–5342. PMLR.
  31. Unsupervised data augmentation for consistency training. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  32. Exploitgen: Template-augmented exploit code generation based on codebert. Journal of Systems and Software, 197:111577.
  33. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 1482–1493, New York, NY, USA. Association for Computing Machinery.
  34. A survey of automated data augmentation algorithms for deep learning-based image classification tasks. Knowledge and Information Systems, 65(7):2805–2861.
  35. Michihiro Yasunaga and Percy Liang. 2020. Graph-based, self-supervised program repair from diagnostic feedback. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  36. Adversarial examples for models of code. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–30.
  37. Data augmentation by program transformation. Journal of Systems and Software, 190:111304.
  38. mixup: beyond empirical risk minimization. In International Conference on Learning Representations (ICLR).
  39. Generating adversarial examples for holding robustness of source code processing models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1169–1176.
  40. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Association for Computing Machinery.

Summary

  • The paper presents GenCode, a dual-stage framework that generates augmented code samples using both semantic-preserving and syntax-breaking transformations followed by loss-based selection.
  • Experimental results show up to a 4.52% increase in accuracy and an 8.42% reduction in attack success rate across various tasks and models.
  • The framework accelerates convergence and provides robust data augmentation, setting new standards for improving deep learning-based code understanding.

GenCode: A Detailed Analysis of a Data Augmentation Framework for Code Understanding

This essay presents an in-depth examination of "GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding" (2402.15769). The paper introduces an innovative approach, GenCode, designed to enhance code understanding models through a structured, generation-and-selection mechanism. This analysis provides a thorough overview of the methodology, results, and implications of the proposed framework for experienced researchers in the field.

Introduction to GenCode Framework

GenCode is proposed to tackle the challenges associated with training data preparation for code models, which is often labor-intensive and costly. Traditional methods such as code refactoring have shown limited benefits in improving model performance. GenCode adopts a dual-stage approach: it first generates potential training data using various code transformation techniques, then selects the most informative samples using importance metrics based primarily on loss values. This methodology aims to improve model accuracy and robustness, creating a more versatile framework applicable across various programming tasks and models.

Methodology

Data Generation and Selection Paradigm

GenCode differentiates itself by integrating both semantic-preserving and syntax-breaking code transformations to augment datasets:

  • Semantic-Preserving Methods: These techniques, including traditional code refactoring, alter code structure without changing its functionality.
  • Syntax-Breaking Methods: Inspired by advances in natural language processing, these methods intentionally modify code syntax to challenge model generalization.

The selection process involves computing the loss value for each generated sample, ranking them, and choosing the top KK samples for model training, where KK matches the original training dataset size. Figure 1

Figure 1

Figure 1

Figure 1: Correlation between loss values and code model accuracy.

Search Space and Importance Metric

GenCode’s effectiveness relies on its expansive search space, currently supporting 18 semantic-preserving and 5 syntax-breaking transformations. Key to its approach is the assumption corroborated by the paper’s preliminary study, that samples with higher loss values more significantly contribute to model refinement. Figure 2

Figure 2: Workflow of GenCode in one training epoch.

Experimental Results

GenCode was evaluated across multiple tasks (bug detection, authorship attribution, and problem classification) and models (CodeBERT, GraphCodeBERT, and CodeT5). Experimental results highlighted:

  • Accuracy Improvement: GenCode enhanced model accuracy by up to 4.52% over models trained without augmentation and surpassed existing methods like MixCode by a significant margin.
  • Robustness Gains: The framework reduced the attack success rate by 8.42% on average, underscoring its efficacy in bolstering model robustness against adversarial attacks.
  • Efficiency of Convergence: The methodology accelerated convergence across all training phases, as evidenced by consistent superiority during training epochs. Figure 3

    Figure 3: Convergence speed of CodeBERT using different code augmentation methods in each task.

Discussion and Implications

The implications of GenCode’s results are multifold:

  • Theoretical Contributions: By validating that high-loss samples enhance learning, GenCode challenges and expands existing narratives in data augmentation research.
  • Practical Applications: The implementation of GenCode can lead to the development of more robust coding assistants and automated bug detection systems, enhancing their resilience and accuracy.
  • Future Research: Potential expansions include integrating uncertainty metrics and applying GenCode to fine-tune LLMs, such as Llama, to enhance generalization across diverse programming languages. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Visualization of code embeddings after dimension reduction using Principal Component Analysis (PCA). Model: CodeBERT, dataset: Refactory, task: Bug detection.

Conclusion

GenCode represents a significant advancement in data augmentation for code understanding. It combines innovative methodologies to improve both the accuracy and robustness of deep learning models. GenCode sets a new standard in the field, demonstrating how structured data generation and selection can effectively refine neural models. Future work may build upon these findings to explore more complex code transformations and the integration of additional importance metrics.

Paper to Video (Beta)

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.