Heterogeneous Directed Hypergraph Neural Network over abstract syntax tree (AST) for Code Classification (2305.04228v3)
Abstract: Code classification is a difficult issue in program understanding and automatic coding. Due to the elusive syntax and complicated semantics in programs, most existing studies use techniques based on abstract syntax tree (AST) and graph neural network (GNN) to create code representations for code classification. These techniques utilize the structure and semantic information of the code, but they only take into account pairwise associations and neglect the high-order correlations that already exist between nodes in the AST, which may result in the loss of code structural information. On the other hand, while a general hypergraph can encode high-order data correlations, it is homogeneous and undirected which will result in a lack of semantic and structural information such as node types, edge types, and directions between child nodes and parent nodes when modeling AST. In this study, we propose to represent AST as a heterogeneous directed hypergraph (HDHG) and process the graph by heterogeneous directed hypergraph neural network (HDHGN) for code classification. Our method improves code understanding and can represent high-order data correlations beyond paired interactions. We assess heterogeneous directed hypergraph neural network (HDHGN) on public datasets of Python and Java programs. Our method outperforms previous AST-based and GNN-based methods, which demonstrates the capability of our model.
- L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural networks over tree structures for programming language processing,” in AAAI, 2016.
- S. Gilda, “Source code classification using neural networks,” 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 1–6, 2017.
- D. Vagavolu, K. C. Swarna, and S. Chimalakonda, “A mocktail of source code representations,” in ASE, 2021.
- W. Wang, K. Zhang, G. Li, and Z. Jin, “Learning to represent programs with heterogeneous graphs,” 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC), pp. 378–389, 2022.
- J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax tree,” 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794, 2019.
- M. Lu, Y. Wang, D. Tan, and L. Zhao, “Student program classification using gated graph attention neural network,” IEEE Access, vol. 9, pp. 87857–87868, 2021.
- U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” in ICLR, 2019.
- U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: learning distributed representations of code,” Proceedings of the ACM on Programming Languages, vol. 3, pp. 1 – 29, 2019.
- X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation,” 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp. 200–20010, 2018.
- S. Liu, Y. Chen, X. Xie, J. Siow, and Y. Liu, “Retrieval-augmented generation for code summarization via hybrid gnn,” in ICLR, 2021.
- L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard: Scalable and accurate tree-based detection of code clones,” 29th International Conference on Software Engineering (ICSE’07), pp. 96–105, 2007.
- M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in ASE, 2016.
- M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” in ICLR, 2018.
- T. Long, Y. Xie, X. Chen, W. Zhang, Q. Cao, and Y. Yu, “Multi-view graph representation for programming language processing: An investigation into algorithm detection,” in AAAI, 2022.
- J. Huang and J. Yang, “Unignn: a unified framework for graph and hypergraph neural networks,” in IJCAI, 2021.
- Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, “Hypergraph neural networks,” in AAAI Conference on Artificial Intelligence, 2018.
- R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker, V. Thost, L. Buratti, S. Pujar, S. Ramji, U. Finkler, S. Malaika, and F. Reiss, “Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks,” 2021.
- N. D. Q. Bui, Y. Yu, and L. Jiang, “Treecaps: Tree-based capsule networks for source code processing,” in AAAI, 2021.
- V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber, “Global relational models of source code,” in ICLR, 2020.
- G. Gallo, G. Longo, and S. Pallottino, “Directed hypergraphs and applications,” Discret. Appl. Math., vol. 42, pp. 177–201, 1993.
- Y. Sun and J. Han, “Mining heterogeneous information networks: Principles and methodologies,” in Mining Heterogeneous Information Networks: Principles and Methodologies, 2012.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 5998–6008, 2017.
- T. Cai, S. Luo, K. Xu, D. He, T.-Y. Liu, and L. Wang, “Graphnorm: A principled approach to accelerating graph neural network training,” in ICML, 2021.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
- K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?,” in ICLR, 2019.