Papers
Topics
Authors
Recent
2000 character limit reached

Code Graph Extractor

Updated 10 December 2025
  • Code Graph Extractor is a tool that transforms raw source code into directed, labeled graphs representing syntactic, semantic, and dependency relationships.
  • It employs multi-stage pipelines—including parsing, semantic enrichment, and view construction—to accurately build and merge various graph views.
  • The extractor underpins advanced applications such as machine learning for code analysis, vulnerability detection, and interactive code visualization with scalable performance.

A code graph extractor is a computational pipeline, tool, or framework that transforms raw source code into a formal graph-based representation delineating the program’s syntactic, semantic, and dependency relationships. This transformation enables downstream tasks in software analysis, machine learning for software engineering, vulnerability detection, code search, and code visualization. Code graph extractors are implemented with diverse architectures including static analyzers, parsing + semantic enrichment workflows, graph rewrite systems, and configurable DSL-driven frameworks. They support various language dimensions and graph formalisms, with interoperability to graph databases and neural computation libraries.

1. Formalisms and Graph Structures

Code graph extractors produce directed, labeled graphs with node and edge semantics specific to the domain. The standard form is G=(V,E)G = (V, E), where VV is a partitioned set of code entities and EE is a multiset of edges tagged by relation type.

  • RefExpo defines VV partitioned into VclassV_{\text{class}} (classes/types), VfuncV_{\text{func}} (functions/methods), and VmodV_{\text{mod}} (modules/packages) (Haratian et al., 2 Jul 2024). Edge types LL include "calls", "imports", "extends"/"implements", "refs", "instantiates". Formally, (ukv)(u \to^k v) for kLk \in L.
  • COMEX outputs G=(V,E)G = (V, E) where VV carries node-type τ(v)T\tau(v) \in T (e.g., "IfStmt", "CallExpr", "BasicBlock") and EE is labeled by Σ\Sigma ("cfg", "dfg", etc.) (Das et al., 2023).
  • Code Property Graph (CPG), as in AI4VA and QVoG, is a multigraph encompassing multiple views: AST, control-flow, data-flow, and dependence (Liu et al., 12 Jun 2024, Suneja et al., 2020). CPG merges nodes from all underlying structures with labeled edges such as IS_AST_CHILD, CFG_NEXT, DEF, USE, CONTROLS, and INVOKES.
  • Code Context Graph (CCG) as in GraphCoder, formalizes the graph as G=(X,E,T,λ)G = (X, E, T, \lambda), where TT is a set of edge types: control-flow (CF), control-dependence (CD), data-dependence (DD) (Liu et al., 11 Jun 2024).
  • Program-Derived Semantics Graph (PSG) introduces a hierarchical, multi-level graph G=(V,E,K,τ,λ)G = (V, E, K, \tau, \lambda) with intra-level and cross-level edges to capture semantic abstractions (Iyer et al., 2020).

Graph matching and evaluation employ recall, precision, and F1F_1 defined over edge sets, e.g.,

R=EdetEgtEgt,P=EdetEgtEdetR = \frac{|E_{\mathrm{det}} \cap E_{\mathrm{gt}}|}{|E_{\mathrm{gt}}|}, \quad P = \frac{|E_{\mathrm{det}} \cap E_{\mathrm{gt}}|}{|E_{\mathrm{det}}|}

as in RefExpo (Haratian et al., 2 Jul 2024).

2. Extraction Algorithms and Architectures

Code graph extraction workflows are typically multi-stage and modular.

  • Parsing stage: Generates a language-native AST (e.g., tree-sitter for COMEX and CodeLens, JavaParser or custom frontend for others).
  • Semantic enrichment: Resolves symbols, types, and scopes, populating symbol tables and cross-references (RefExpo, COMEX).
  • View construction: Traverses the enriched AST to emit graph nodes and edges representative of dependencies, flows, and structural relations (RefExpo, COMEX, GraphCoder).
  • Graph composition: Optionally merges multiple views (AST, CFG, DFG, PDG) into a unified graph, possibly with cross-view links (COMEX, CONCORD).
  • Reductions and transformations: Apply pruning heuristics, node fusion, or abstraction lifting (CONCORD reduction heuristics, deGraphCS graph optimizations).

For example, RefExpo employs the following workflow (Haratian et al., 2 Jul 2024):

1
2
3
4
5
6
7
8
procedure ExtractDependencyGraph(source_folders):
    V  ; E  
    for each file in source_folders:
        ast  Parser.parse(file)
        typed_ast  TypeAnalyzer.resolve(ast)
        (V_f, E_f)  GraphBuilder.visit(typed_ast)
        V  V  V_f; E  E  E_f
    return DependencyGraph(V, E)
COMEX leverages tree-sitter, enabling rapid extraction for >40 languages, with further customizable composition of views via config-driven pipelines (Das et al., 2023).

3. Language Coverage and Extensibility

Modern code graph extractors span multiple programming languages (Java, Python, C/C++, JavaScript, Scala, Go, C#, TypeScript, PHP):

  • RefExpo natively supports Java, Python, and JavaScript via IntelliJ plugin interfaces (Haratian et al., 2 Jul 2024).
  • COMEX is extendable via tree-sitter grammars; new languages are integrated by adding grammars and configuring AST filters (Das et al., 2023).
  • Fraunhofer CPG utilizes fuzzy parsing for incomplete/non-compilable code and offers frontends for Java, C/C++, Go, Python, TypeScript, LLVM-IR, extensible via JVM/JNI parser modules (Weiss et al., 2022).
  • scg-cli targets Java and Scala, serializing all entities and edges to language-agnostic protobuf (Borowski et al., 2023).
  • CONCORD leverages Joern's multi-language parsing infrastructure and declarative operation PEG grammar for graph customization (Saad et al., 31 Jan 2024).
  • GraphGen4Code, QVoG, and CodeGen4Code support large-scale extraction on Python, C, Java, and JavaScript (Abdelaziz et al., 2020, Liu et al., 12 Jun 2024).

4. Graph Reduction, Optimization, and Scalability

Scaling graph extraction to large codebases necessitates both computational efficiency and graph reduction mechanisms:

  • QVoG implements compressed CPG construction, replacing full AST graphs with statement-level nodes and dependency edges, yielding compression ratios of >20×>20\times over classic CPGs and enabling analysis of $1.5$M LOC projects in ~15 minutes with $5.2$GB memory (Liu et al., 12 Jun 2024).
  • CONCORD applies pruning (removal of simple assignments, prints) and edge augmentation (e.g., NextToken, ForCFG), reducing node/edge counts by up to 20%20\% while maintaining $88$--$100$\% of code-smell detection performance (Saad et al., 31 Jan 2024).
  • deGraphCS eliminates temporaries and trivial opcodes, merges linear basic blocks, and fuses SSA variables, typically reducing variable-based flow graphs by 50%50\% in node count, leading to faster GNN training/inference (Zeng et al., 2021).
  • RefExpo achieves high recall (92\% for Python, 100\% for Java on micro test suites) and outperforms prior tools by 31%31\% and 7%7\% in unique/shared result detection on macro-level benchmarks (Haratian et al., 2 Jul 2024). Its plugin model provides responsive integration into the IntelliJ IDE.

5. Interfacing, Export, and Query Mechanisms

Code graph extractors provide APIs and formats for downstream consumption:

6. Downstream Applications and Research Impact

Code graph extractors are foundational in numerous software engineering and ML domains:

A plausible implication is that graph abstraction and configurable extraction workflows (as in CONCORD and PSG) represent emerging best practice, necessary to balance graph fidelity and tractability for large-scale ML-SE and repository analysis.

7. Future Directions and Challenges

Current challenges stem from scalability, incomplete code, semantic abstraction, and cross-language unification:

  • Incomplete/Non-Compilable Code: Fraunhofer CPG and COMEX utilize fuzzy parsers and flexible grammars to process partial, non-compilable, or even snippet-level code (Das et al., 2023, Weiss et al., 2022).
  • Semantic Lifting: The PSG approach advocates multi-level abstraction, automatically learning higher-order semantics from corpora, beyond rigid AST/CFG forms (Iyer et al., 2020).
  • Query Generalization: Integration of LLM agents for code analysis (CodexGraph), ML-based source/sink classification (QVoG), and template learning for vulnerabilities (AI4VA) indicate a shift toward hybrid AI-code analysis pipelines (Liu et al., 7 Aug 2024, Liu et al., 12 Jun 2024, Suneja et al., 2020).
  • Interoperability: Protobuf and JSON-LD (scg-cli, GraphGen4Code) facilitate toolchain integration and platform-independent analytics (Borowski et al., 2023, Abdelaziz et al., 2020).
  • Graph-based Representation Learning: Observed performance advantages of graph-driven ML models highlight graph extraction as a critical research primitive in software engineering and code understanding (Cheng et al., 2021, Zeng et al., 2021, Suneja et al., 2020).

This suggests that configurable, scalable, and semantically expressive code graph extractors are essential to modern program analysis, ML4SE, and repository-level cognition. The continued evolution of multi-view, multi-level, and ML-integrated extraction pipelines is likely to further enhance code analytics and intelligent development workflows.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Code Graph Extractor.