Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? (2312.00413v1)

Published 1 Dec 2023 in cs.SE, cs.AI, cs.CL, and cs.PL

Abstract: Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4998–5007.
  2. Compilers: Principles, Techniques, and Tools. Vol. 2. Addison-wesley Reading, Addison Wesley.
  3. Learning to Represent Programs with Graphs. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada, 1–17.
  4. Code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of the 7th International Conference on Learning Representations-Poster. OpenReview.net, New Orleans, LA, USA, 1–13.
  5. A General Path-based Representation for Predicting Program Properties. In Proceedings of the 39th SIGPLAN Conference on Programming Language Design and Implementation. ACM, Philadelphia, PA, USA, 404–419.
  6. Code2Vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages 3, POPL (jan 2019), 40:1–40:29.
  7. Andrea Arcuri and Lionel C. Briand. 2014. A Hitchhiker’s Guide to Statistical Tests for Assessing Randomized Algorithms in Software Engineering. Software Testing, Verification & Reliability 24, 3 (2014), 219–250.
  8. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3th International Conference on Learning Representations. OpenReview.net, San Diego, CA, USA, 1–15.
  9. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, USA, 65–72.
  10. Project-Level Encoding for Neural Source Code Summarization of Subroutines. In Proceedings of the 29th International Conference on Program Comprehension. IEEE, Madrid, Spain, 253–264.
  11. Clone Detection Using Abstract Syntax Trees. In Proceedings of the 6th International Conference on Software Maintenance. IEEE Computer Society, Bethesda, Maryland, USA, 368–377.
  12. Francesco Bertolotti and Walter Cazzola. 2023. Fold2Vec: Towards a Statement-Based Representation of Code for Code Comprehension. ACM Transactions on Software Engineering and Methodology 32, 1 (2023), 6:1–6:31.
  13. Authorship Attribution of Source Code: A Language-agnostic Approach and Applicability in Software Engineering. In Proceedings of the 29th Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Athens, Greece, 932–944.
  14. Lutz Büch and Artur Andrzejak. 2019. Learning-Based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection. In Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering, Xinyu Wang, David Lo, and Emad Shihab (Eds.). IEEE, Hangzhou, China, 95–104.
  15. Search for Compatible Source Code. International Journal of Software Engineering and Knowledge Engineering 31, 3 (2021), 477–502.
  16. Why My Code Summarization Model Does Not Work: Code Comment Improvement with Category Prediction. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 25:1–25:29.
  17. Qingying Chen and Minghui Zhou. 2018. A Neural Framework for Retrieval and Summarization of Source Code. In Proceedings of the 33rd International Conference on Automated Software Engineering. ACM, Montpellier, France, 826–831.
  18. Microsoft COCO Captions: Data Collection and Evaluation Server. CoRR abs/1504.00325, 1 (2015), 1–7.
  19. Tree-to-tree Neural Networks for Program Translation. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems. Curran Associates Inc., Montréal, Canada, 2552–2562.
  20. Yi Cheng and Li Kuang. 2022. CSRS: Code Search with Relevance Matching and Semantic Matching. In Proceedings of the 30th International Conference on Program Comprehension. ACM, Virtual Event, 533–542.
  21. Lin Chin-Yew. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics – workshop on Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.
  22. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing. ACL, Doha, Qatar, 1724–1734.
  23. srcML: An Infrastructure for the Exploration, Analysis, and Manipulation of Source Code: A Tool Demonstration. In Proceedings of the 29th International Conference on Software Maintenance. IEEE Computer Society, Eindhoven, The Netherlands, 516–519.
  24. Fine-grained Co-Attentive Representation Learning for Semantic Code Search. In Proceedings of the 29th International Conference on Software Analysis, Evolution and Reengineering. IEEE, Honolulu, HI, USA, 396–407.
  25. Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search. In Proceedings of the 30th International Conference on Information & Knowledge Management. ACM, Queensland, Australia, 2994–2998.
  26. Functional Code Clone Detection with Syntax and Semantics Fusion Learning. In Proceedings of the 29th International Symposium on Software Testing and Analysis. ACM, Virtual Event, USA, 516–527.
  27. Self-attention Networks for Code Search. Information and Software Technology 134 (2021), 106542.
  28. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 25th Conference on Empirical Methods in Natural Language Processing: Findings. Association for Computational Linguistics, Online Event, 1536–1547.
  29. Yuexiu Gao and Chen Lyu. 2022. M2TS: Multi-Scale Multi-modal Approach based on Transformer for Source Code Summarization. In Proceedings of the 30th International Conference on Program Comprehension. ACM, Virtual Event, 24–35.
  30. Source Code Summarization with Structural Relative Position Guided Transformer. In Proceedings of the 29th International Conference on Software Analysis, Evolution and Reengineering. IEEE, Honolulu, HI, USA, 13–24.
  31. Multimodal Representation for Neural Code Search. In Proceedings of the 37th International Conference on Software Maintenance and Evolution. IEEE, Luxembourg, 483–494.
  32. CRaDLe: Deep Code Retrieval Based on Semantic Dependency Learning. Neural Networks 141 (2021), 385–394.
  33. Deep Code Search. In Proceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg, Sweden, 933–944.
  34. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Dublin, Ireland, 7212–7225.
  35. A Multi-Perspective Architecture for Semantic Code Search. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8563–8568.
  36. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural computation 9, 8 (1997), 1735–1780.
  37. Neural Joint Attention Code Search Over Structure Embeddings for Software Q&A Sites. Journal of Systems and Software 170, 1 (2020), 110773.
  38. Deep Code Comment Generation. In Proceedings of the 26th International Conference on Program Comprehension. ACM, Gothenburg, Sweden, 200–210.
  39. Deep Code Comment Generation with Hybrid Lexical and Syntactical Information. Empirical Software Engineering 25, 3 (2020), 2179–2217.
  40. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. ijcai.org, Stockholm, Sweden, 2269–2275.
  41. FCCA: Hybrid Code Representation for Functional Clone Detection Using Attention Networks. IEEE Transactions on Reliability 70, 1 (2021), 304–318.
  42. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR abs/1909.09436, 1 (2019), 1–6.
  43. Summarizing Source Code using a Neural Attention Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics, Berlin, Germany, 2073–2083.
  44. Eclipse Java Development tools. site:https://www.eclipse.org/jdt/. Accessed: 2023-11-03.
  45. JavaParser. 2019. JavaParser : Analyse, Transform and Generate your Java Codebase. site:https://javaparser.org/. Accessed: 2023-11-03.
  46. TreeBERT: A Tree-based Pre-trained Model for Programming Language. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence. AUAI Press, Virtual Event, 54–63.
  47. Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net, Toulon, France, 1–10.
  48. Improved Code Summarization via a Graph Neural Network. In Proceedings of the 28th International Conference on Program Comprehension. ACM, Seoul, Republic of Korea, 184–195.
  49. A Neural Model for Generating Natural Language Summaries of Program Subroutines. In Proceedings of the 41st International Conference on Software Engineering. IEEE / ACM, Montreal, QC, Canada, 795–806.
  50. Backpropagation Applied to Handwritten Zip Code Recognition. Neural computation 1, 4 (1989), 541–551.
  51. Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting. In Proceedings of the 29th International Conference on Program Comprehension. IEEE, Madrid, Spain, 184–195.
  52. GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search. IEEE Transactions on Software Engineering 49, 4 (2023), 2839–2855.
  53. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1. Openreview.net, virtual, 1–14.
  54. Tree-sitter. site:https://github.com/tree-sitter/tree-sitter. Accessed: 2023-11-03.
  55. Recurrent Neural Network based Language Model. In Proceedings of the 11th Annual Conference of the International Speech Communication Association. ISCA, Makuhari, Chiba, Japan, 1045–1048.
  56. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the 30th Conference on Artificial Intelligence. AAAI Press, Phoenix, Arizona, USA, 1287–1293.
  57. Nadim Nachar et al. 2008. The Mann-Whitney U: A Test for Assessing whether Two Independent Samples Come From the Same Distribution. Tutorials in Quantitative Methods for Psychology 4, 1 (2008), 13–20.
  58. Jaccard P. 1901. Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat 37, 1 (1901), 547–579.
  59. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL, Philadelphia, PA, USA, 311–318.
  60. Terence J. Parr and Russell W. Quong. 1995. ANTLR: A Predicated-LL (k) Parser Generator. Software: Practice and Experience 25, 7 (1995), 789–810.
  61. Integrating Tree Path in Transformer for Code Representation. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems. Curran Associates Inc., Virtual, 9343–9354.
  62. Chanchal Kumar Roy and James R Cordy. 2007. A Survey on Software Clone Detection Research. Queen’s School of Computing TR 541, 115 (2007), 64–68.
  63. Retrieval on Source Code: A Neural Code Search. In Proceedings of the 2nd International Workshop on Machine Learning and Programming Languages. ACM, Philadelphia, PA, USA, 31–41.
  64. A Systematic Mapping Study of Source Code Representation for Deep Learning in Software Engineering. IET Software 16, 4 (2022), 351–385.
  65. Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
  66. API2Com: On the Improvement of Automatically Generated Code Comments Using API Documentations. In Proceedings of the 29th International Conference on Program Comprehension. IEEE, Madrid, Spain, 411–421.
  67. On the Evaluation of Neural Code Summarization. In Proceedings of the 44th International Conference on Software Engineering. IEEE, Pittsburgh, USA, 1597––1608.
  68. CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees. In Proceedings of the 26th Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Virtual Event / Punta Cana, Dominican Republic, 4053–4062.
  69. Automatic Source Code Summarization with Extended Tree-LSTM. In Proceedings of the 18th International Joint Conference on Neural Networks. IEEE, Budapest, Hungary, 1–8.
  70. Improving Code Search with Co-Attentive Representation Learning. In Proceedings of the 28th International Conference on Program Comprehension. Association for Computing Machinery, Seoul, Republic of Korea, 196–207.
  71. Learning Program Semantics with Code Representations: An Empirical Study. In Proceedings of the 29th International Conference on Software Analysis, Evolution and Reengineering. IEEE, Honolulu, HI, USA, 554–565.
  72. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In Proceedings of the 28th International Conference on Machine Learning. Omnipress, Bellevue, Washington, USA, 129–136.
  73. Code Search based on Context-aware Code Translation. In Proceedings of the 44th International Conference on Software Engineering. ACM, Pittsburgh, PA, USA, 388–400.
  74. An Extractive-and-Abstractive Framework for Source Code Summarization. ACM Transactions on Software Engineering and Methodology Just Accepted, 1 (2023), 1–39.
  75. Artifacts of Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? site:https://github.com/wssun/AST4PLU. Accessed: 2023-11-23.
  76. Towards a Big Data Curated Benchmark of Inter-project Code Clones. In Proceedings of the 30th International Conference on Software Maintenance and Evolution. IEEE Computer Society, Victoria, BC, Canada, 476–480.
  77. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. The Association for Computer Linguistics, Beijing, China, 1556–1566.
  78. AST-Trans: Code Summarization with Efficient Tree-Structured Attention. In Proceedings of the 44th International Conference on Software Engineering. ACM, Pittsburgh, PA, USA, 150–162.
  79. Chris Thunes. 2018. javalang. site:https://github.com/c2nes/javalang. Accessed: 2023-11-03.
  80. Deep Learning Similarities from Different Representations of Source Code. In Proceedings of the 15th International Conference on Mining Software Repositories. ACM, Gothenburg, Sweden, 542–553.
  81. Evaluating the Impact of Source Code Parsers on ML4SE Models. CoRR abs/2206.08713, 1 (2022), 1–12.
  82. Attention is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Curran Associates Inc., Long Beach, CA, USA, 5998–6008.
  83. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. In Proceedings of the 34th International Conference on Automated Software Engineering. IEEE, San Diego, CA, USA, 13–25.
  84. Improving Automatic Source Code Summarization via Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Automated Software Engineering. ACM, Montpellier, France, 397–407.
  85. MulCode: A Multi-task Learning Approach for Source Code Understanding. In Proceedings of the 28th International Conference on Software Analysis, Evolution and Reengineering. IEEE, Honolulu, HI, USA, 48–59.
  86. Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification. In Proceedings of the 30th International Conference on Program Comprehension. ACM, Virtual Event, 390–400.
  87. Learning Program Representations with a Tree-Structured Transformer. In Proceedings of the 30th International Conference on Software Analysis, Evolution and Reengineering. IEEE, Taipa, Macao, 248–259.
  88. Reinforcement-Learning-Guided Source Code Summarization using Hierarchical Attention. IEEE Transactions on Software Engineering 48, 1 (2020), 102–119.
  89. CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network. CoRR abs/2107.01933, 1 (2021), 1–24.
  90. Retrieve and Refine: Exemplar-based Neural Comment Generation. In Proceedings of the 35th International Conference on Automated Software Engineering. IEEE, Melbourne, Australia, 349–360.
  91. Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, Melbourne, Australia, 3034–3040.
  92. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of the 31st International Conference on Automated Software Engineering. ACM, Singapore, 87–98.
  93. David S. Wile. 1997. Abstract Syntax from Concrete Syntax. In Proceedings of the 19th International Conference on Software Engineering. ACM, Boston, Massachusetts, USA, 472–480.
  94. Code Summarization with Structure-induced Transformer. In Proceedings of the Findings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Online Event, 1078–1090.
  95. Two-Stage Attention-Based Model for Code Search with Textual and Structural Features. In Proceedings of the 28th International Conference on Software Analysis, Evolution and Reengineering. IEEE, Honolulu, HI, USA, 342–353.
  96. A Multi-Modal Transformer-based Code Summarization Approach for Smart Contracts. In Proceedings of the 29th International Conference on Program Comprehension. IEEE, Madrid, Spain, 1–12.
  97. MISIM: An End-to-End Neural Code Similarity System. CoRR abs/2006.05265, 1 (2020), 1–23.
  98. Retrieval-based Neural Source Code Summarization. In Proceedings of the 42nd International Conference on Software Engineering. ACM, Seoul, South Korea, 1385–1397.
  99. A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In Proceedings of the 41th International Conference on Software Engineering. IEEE / ACM, Montreal, QC, Canada, 783–794.
  100. Gang Zhao and Jeff Huang. 2018. DeepSim: Deep Learning Code Functional Similarity. In proceedings of the 17th Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Lake Buena Vista, FL, USA, 141–151.
  101. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems. Curran Associates Inc., Vancouver, BC, Canada, 10197–10207.
  102. Automatic Source Code Summarization with Graph Attention Networks. Journal of Systems and Software 188 (2022), 111257.
Citations (1)

Summary

We haven't generated a summary for this paper yet.