GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding (2402.15769v2)

Published 24 Feb 2024 in cs.SE and cs.AI

Abstract: Pre-trained code models lead the era of code intelligence with multiple models have been designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code transformation techniques to generate new code candidates first and then selects important ones as the training data by importance metrics. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5). Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.

References (40)

Authors (6)

Zeming Dong (4 papers)
Qiang Hu (149 papers)
Xiaofei Xie (106 papers)
Maxime Cordy (61 papers)
Mike Papadakis (64 papers)
Jianjun Zhao (63 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding (2402.15769v2)

Summary

Related Papers