Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COMEX: A Tool for Generating Customized Source Code Representations (2307.04693v1)

Published 10 Jul 2023 in cs.SE and cs.AI

Abstract: Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, LLMs like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating code-views for every programming language is cumbersome and time consuming. To overcome this barrier, we propose our tool COMEX - a framework that allows researchers and developers to create and combine multiple code-views which can be used by ML models for various SE tasks. Some salient features of our tool are: (i) it works directly on source code (which need not be compilable), (ii) it currently supports Java and C#, (iii) it can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural analysis, and (iv) it is easily extendable to other languages as it is built on tree-sitter - a widely used incremental parser that supports over 40 languages. We believe this easy-to-use code-view generation and customization tool will give impetus to research in source code representation learning methods and ML4SE. Tool: https://pypi.org/project/comex - GitHub: https://github.com/IBM/tree-sitter-codeviews - Demo: https://youtu.be/GER6U87FVbU

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” arXiv preprint arXiv:1711.00740, 2017.
  2. M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Comput. Surv., vol. 51, no. 4, jul 2018. [Online]. Available: https://doi.org/10.1145/3212695
  3. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366, 2020.
  4. V. J. Hellendoorn, P. Maniatis, R. Singh, C. Sutton, and D. Bieber, “Global relational models of source code,” 2020. [Online]. Available: https://openreview.net/forum?id=B1lnbRNtwr
  5. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” 2023.
  6. M. Chen, J. Tworek, H. Jun et al., “Evaluating large language models trained on code,” 2021.
  7. R. Mukherjee, Y. Wen, D. Chaudhari, T. W. Reps, S. Chaudhuri, and C. Jermaine, “Neural program generation modulo static analysis,” 2021.
  8. D. Johnson, H. Larochelle, and D. Tarlow, “Learning graph structure with a finite-state automaton layer,” Advances in Neural Information Processing Systems, vol. 33, pp. 3082–3093, 2020.
  9. D. Bieber, R. Goel, D. Zheng, H. Larochelle, and D. Tarlow, “Static prediction of runtime errors by learning to execute programs with external resource descriptions,” 2022.
  10. S. Vasudevan, W. J. Jiang, D. Bieber, R. Singh, C. R. Ho, C. Sutton et al., “Learning semantic representations to verify hardware designs,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 491–23 504, 2021.
  11. R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan, “Soot - a java bytecode optimization framework,” in Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, ser. CASCON ’99.   IBM Press, 1999, p. 13.
  12. H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” 2019. [Online]. Available: https://arxiv.org/abs/1909.09436
  13. J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a big data curated benchmark of inter-project code clones,” in 2014 IEEE International Conference on Software Maintenance and Evolution.   IEEE, 2014, pp. 476–480.
  14. D. Bieber, K. Shi, P. Maniatis, C. Sutton, V. Hellendoorn, D. Johnson, and D. Tarlow, “A library for representing python programs as graphs for machine learning,” arXiv preprint arXiv:2208.07461, 2022.
  15. Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” Advances in neural information processing systems, vol. 32, 2019.
  16. B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribution using long short-term memory based networks,” in Computer Security–ESORICS 2017: 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part I 22.   Springer, 2017, pp. 65–82.
  17. Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “Vulpecker: an automated vulnerability detection system based on code similarity analysis,” in Proceedings of the 32nd annual conference on computer security applications, 2016, pp. 201–213.
  18. E. Dauber, A. Caliskan, R. Harang, and R. Greenstadt, “Git blame who? stylistic authorship attribution of small, incomplete source code fragments,” in Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, 2018, pp. 356–357.
  19. A. Machiry, N. Redini, E. Camellini, C. Kruegel, and G. Vigna, “Spider: Enabling fast patch propagation in related software repositories,” in 2020 IEEE Symposium on Security and Privacy (SP).   IEEE, 2020, pp. 1562–1579.
  20. T. Long, Y. Xie, X. Chen, W. Zhang, Q. Cao, and Y. Yu, “Multi-view graph representation for programming language processing: An investigation into algorithm detection,” arXiv preprint arXiv:2202.12481, 2022.
  21. IBM, “Project codenet,” https://developer.ibm.com/data/project-codenet/, 2021, accessed: 2022-06-10.
  22. A. T. Nguyen, T. T. Nguyen, and T. N. Nguyen, “Divide-and-conquer approach for multi-phase statistical migration for source code (t),” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2015, pp. 585–596.
Citations (2)

Summary

  • The paper introduces COMEX, which generates customizable code views that capture both syntactic and semantic structures for improved representation learning.
  • It employs multi-level analyses including CFG, DFG, and AST generation, enabling efficient static analysis even for non-compilable code.
  • COMEX supports diverse programming languages, facilitating enhanced research in software engineering with validated performance on benchmark datasets.

COMEX: A Tool for Generating Customized Source Code Representations

The paper introduces COMEX, a versatile tool designed to enhance source code representation learning by generating and customizing code views for software engineering tasks. The primary motivation behind COMEX is to address the limitations of existing tools that predominantly treat source code as mere text sequences, overlooking its inherent structural and grammatical patterns. Unlike natural language, source code presents unique challenges due to its strict syntactic and semantic constraints, governed by its underlying grammar. LLMs like Codex and CodeGen have achieved high performance by treating code as sequences, but they may not fully capitalize on the inherent code structure.

COMEX facilitates a more informative and computationally effective approach by allowing users to generate and customize multiple code-views, such as Control Flow Graphs (CFG), Data Flow Graphs (DFG), and Abstract Syntax Trees (AST). The tool addresses three main challenges found in existing solutions: the dependency on compilable code, language specificity, and the capability to support both intra-procedural and inter-procedural analyses.

Key Features

COMEX offers the following distinctive features:

  • Direct Code Interaction: It processes both compilable and non-compilable source code directly, making it applicable to incomplete code snippets often found in research datasets.
  • Language Support: Initially supporting Java and C#, COMEX extends its capability to additional languages due to its foundation on tree-sitter, a parser that supports over 40 languages. This extensibility is crucial for broadening the applicability across different programming environments.
  • Multi-level Analysis: The tool performs both intra-procedural and inter-procedural analyses, enabling the generation of representations at both the method and program levels.
  • Customizable Code-Views: Users can generate combinations of code views and customize them to suit specific machine learning tasks. This flexibility allows for the creation of over 15 different customized representations.

Implications and Evaluation

COMEX introduces essential improvements in the field of Static Analysis and Representation Learning. Its ability to generate code-views devoid of compilation requirements widens its applicability, especially in scenarios where code snippets are incomplete or isolated at the method level. Furthermore, COMEX has undergone extensive testing on significant software engineering datasets such as CodeNet and CodeSearchNet, proving its robustness in real-world applications.

The potential impact of COMEX on the ML4SE community is substantial. By lowering the barrier for researchers to utilize complex code representations, it encourages the development of new methods that exploit source code’s structural properties. This shift could lead to more efficient and accurate ML models that go beyond treating code as linear text.

Limitations and Future Directions

While COMEX provides a comprehensive framework for generating and analyzing code-views, there are areas for further enhancement. One noted limitation is its approximation approach to alias analysis due to the requirement of handling non-compilable code. Addressing this challenge could improve accuracy in representing data-flow relations, especially in inter-procedural analyses.

Future work could focus on incorporating more complex code-view combinations and expanding language support to include other widely-used programming languages such as Python and C++. Additionally, integrating more nuanced aspects of code analysis, like precise alias tracking and interactive visualization tools, could broaden its utility in static code analysis and Machine Learning for Software Engineering.

In summary, COMEX represents a significant stride in the development of practical tools for source code representation, offering flexibility and support that could stimulate further innovations in programming language processing and software engineering research.

Github Logo Streamline Icon: https://streamlinehq.com