Structural-Semantic Code Graph (SSCG)

Updated 3 December 2025

Structural-Semantic Code Graph (SSCG) is a heterogeneous, directed, and typed graph that represents both structural and semantic code relationships.
It integrates various dependency types such as containment, inheritance, invocation, and semantic similarity to facilitate multi-hop retrieval and comprehensive analysis.
Variants like CSBASG and multi-level SSCG demonstrate practical gains, including up to a 27% reduction in node count and improved retrieval-augmented code generation.

A Structural-Semantic Code Graph (SSCG) is a heterogeneous, directed, and typed graph representation that encodes both structural and semantic dependencies among code elements within a codebase. SSCGs unify diverse types of code relationships—such as containment, inheritance, invocation, and semantic similarity—enabling fine-grained analysis, repository-aware code retrieval, and software comprehension. SSCGs are central in advanced retrieval-augmented code generation, repository-level code completion, semantic code analysis, and graph-based code comparison, providing a comprehensive abstraction that bridges symbolic program structure and vector-based machine learning systems (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025, Borowski et al., 2023, Borowski et al., 2023, Wu et al., 29 Feb 2024).

1. Formal Models and Key Abstractions

An SSCG is formally defined as a heterogeneous directed graph, with nodes and edges parameterized by types and attributes inherent to the targeted programming language or repository. In a general formulation (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025):

$G=(C,D,M,P,A)$ $G = (C, D, M, P, A)$ , where
- $C$ is the set of code-element nodes (e.g., files, classes, functions, methods).
- $D\subseteq C\times C$ is the set of directed edges.
- $M$ is the set of node types.
- $P$ is the set of edge types.
- $A$ maps per-node and per-edge attributes (e.g., names, signatures, file paths, embeddings, weights).
Type assignment functions:
- $\sigma:C\rightarrow M$ assigns node types.
- $\zeta:D\rightarrow P$ assigns edge types.

Each node may encode identity, summary code/text, a type tag, positional or path encodings, and pre-trained code embeddings. Each edge records relation type, source location, and optionally numeric weights (for similarity or dependency strength).

A representative (non-exhaustive) set of SSCG node and edge types is provided in the following table:

Node Types	Edge Types	Attributes
file, class, function,	import, contain, inherit, invoke,	name, signature,
method, package, field,	semantically_similar, call, data_flow,	source snippet, path
interface, variable	declaration, extend, override, return_type,	embedding, node type

Semantic similarity edges are computed via code/text embeddings and cosine similarity. Structural edges encode explicit static dependencies (e.g., syntactic containment, inheritance, invocation, file imports) (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025, Borowski et al., 2023, Borowski et al., 2023).

SSCGs generalize to complex settings, capturing repository-level structure (file organization, ASTs, call graphs, inheritance hierarchies, and data flow) and can be extended by language-specific edge or node types (Wang et al., 7 Sep 2025, Wu et al., 29 Feb 2024).

2. Extraction Pipelines and Representation Construction

The construction of an SSCG typically involves the following steps (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025, Borowski et al., 2023, Borowski et al., 2023):

Parsing & Symbol Resolution: Parse all source files (e.g., via tree-sitter, JavaParser), build ASTs, resolve bindings, types, modifiers, and references.
Node Extraction: For each significant code element (file, class, function, method, etc.), instantiate a node with relevant attributes.
Edge Extraction:
- Containment: Parent→child (e.g., file to class, class to method).
- Dependency Edges: Import (file), inherit (class), invoke/call (function/method), type_reference (field/type).
- Control/Data Flow: Definition-use relations, formal/actual parameter mapping, return type.
- Semantic Similarity: Compute embeddings using a pretrained model (e.g., CodeT5), insert an edge for pairs exceeding a threshold similarity score (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025).
Attribute Annotation: Populate node and edge attributes—identifiers, code signatures, file location, embeddings, weights.
Graph Assembly: Merge subgraphs from different levels (file, function, AST, call graph, inheritance graph, data-flow graph) into a unified SSCG, linking entities via cross-level edges (e.g., file→function, function→AST root) (Wang et al., 7 Sep 2025).
Persistence: Serialize to portable formats (e.g., Protocol Buffers), write to disk or graph database (e.g., Neo4j) (Borowski et al., 2023, Borowski et al., 2023).

Some implementations, such as CSBASG for Alloy predicates, exploit advanced encoding schemes: nodes collapse semantically equivalent AST labels, and edges are complex-weighted to enforce a structural balance condition—uniquely representing the AST, guaranteeing lossless recovery, and reducing redundancy by 27.25% on average (Wu et al., 29 Feb 2024).

3. Retrieval, Reasoning, and Downstream Utilization

SSCGs are leveraged by various systems for repository-level code retrieval, code generation, and comprehension. Typical retrieval and reasoning mechanisms include (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025):

Multi-hop Graph Traversal: Given “seed” nodes (e.g., mapped from requirements), agents traverse the SSCG using allowed edge types (structural, semantic) and a hop limit, iteratively expanding the frontier based on relevance scores. Agent actions include expanding traversal or stopping to emit final results.
Hybrid Graph Retrieval: Both semantic (embedding similarity) and structural (GNN-based) similarity are used to retrieve candidate subgraphs; scores are combined and reranked using mechanisms such as Maximal Marginal Relevance to enhance diversity (Wang et al., 7 Sep 2025).
Structural Fusion: Retrieved subgraphs are fused with the query graph through node-feature aggregation and cross-graph alignment edges; the resulting graph can be serialized for LLM input (e.g., as a sequence of structured triples) (Wang et al., 7 Sep 2025).
Metric Analysis and Querying: Graph analytics—centrality, clustering, partitioning, and ranking—enable identification of crucial entities, code clones, or structural hotspots (Borowski et al., 2023, Borowski et al., 2023).
Reasoning Loop: ReAct-style loops interleave “Thought → Action → Observation” using SSCG-traversals, supporting iterated reasoning and contextual retrieval (Li et al., 14 Apr 2025).

Key metrics and analyses enabled by SSCG-based pipelines include centrality measures (degree, eigenvector, PageRank, betweenness), density, clustering coefficients, partitioning (e.g., via gpmetis, for modularity), and project structure summaries (Borowski et al., 2023).

4. Variants and Specializations: CSBASG and Multi-level SSCGs

Distinct instantiations of SSCG appear across research contexts:

CSBASG in AlloyASG (Wu et al., 29 Feb 2024): Instantiates SSCG as a compact, lossless, structurally balanced graph encoding Alloy predicates. Nodes collapse identical AST labels; edges encode link multiplicity and position via integer magnitude; edges are weighted via complex numbers parameterized by signature differences. The adjacency matrix $A=[a_{ij}]$ satisfies the balance constraint $a_{ij}=|a_{ij}|e^{i(\theta_i-\theta_j)}$ , permitting spectral methods and precise AST reconstruction. Empirical results show a 27% reduction in node count versus ASTs.
Multi-level SSCG in GRACE (Wang et al., 7 Sep 2025): Unifies hierarchical code representations—file-system structure, ASTs, call graphs, inheritance graphs, and data-flow graphs—into a single SSCG. Node features concatenate text/code embeddings with Laplacian positional encodings. Cross-layer edges tie together function definitions, type usages, and control/data flow, enabling repository-wide semantic and structural retrieval for LLM-augmented code completion.
SCG/SSCG in scg-cli and Borowski et al. (Borowski et al., 2023, Borowski et al., 2023): Implements an SSCG called the Semantic Code Graph for Java/Scala comprehension and analysis. The graph captures static dependencies, supports flexible node and edge kinds (package, class, method, field, etc.; declaration, call, inheritance, data flow, etc.), and can be extracted using open-source tools for downstream analysis and partitioning.

5. Empirical Results and Comparative Analysis

Recent work employing SSCGs has demonstrated measurable benefits in multiple domains:

Retrieval-augmented code generation: GraphCodeAgent, using a dual SSCG/requirement graph pipeline, outperforms baselines on repo-level generation tasks by enabling systematic multi-hop retrieval of relevant code, supporting both explicit and implicit dependencies (Li et al., 14 Apr 2025).
Repository-aware code completion: GRACE, which centers on a unified SSCG graph and hybrid graph retrieval, achieves relative gains of +8.19% Exact Match and +7.51% Edit Similarity over the best prior graph-RAG baselines in multi-language code completion tasks (Wang et al., 7 Sep 2025).
Code comprehension and refactoring: Empirical validation on public Java/Scala projects has shown that SCGs enable identification of critical entities (e.g., methods with abnormally high outgoing call degrees), code clones, and structural hot spots. A user survey found SCG-based rankings were considered most maintenance critical 64% of the time, compared to 25% for class collaboration networks and 10% for call graphs (Borowski et al., 2023).
Compactness and comparability: CSBASG demonstrated, on Alloy models, 27.2% reduction in node count compared to ASTs, while supporting fine-grained graph comparison and mutation-based repair workflows (Wu et al., 29 Feb 2024).

6. Applications, Limitations, and Research Directions

SSCGs have established roles in:

Retrieval-augmented generation and code synthesis: Enabling multi-hop graph-guided retrieval for LLM-driven code agents (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025).
Repository-level code search and structural reasoning: Supporting context-dependent completion and robust code similarity detection (Wang et al., 7 Sep 2025, Borowski et al., 2023).
Software comprehension and refactoring: Facilitating project overviews, maintenance prioritization, visualization, and modularity assessment (Borowski et al., 2023, Borowski et al., 2023).
Automated model repair and grading for declarative languages: Leveraging edge-level graph mutation and spectral properties (Wu et al., 29 Feb 2024).

Known limitations include partial coverage of dynamic dependencies (reflection, runtime-loaded modules), language coverage (most implementations target a subset of typed/static languages), and scaling to very large codebases. Prospective directions include integrating richer control/data flow edges, version history change-impact analysis, expanded language support, and deeper IDE or LLM integration.

7. Significance and Outlook

The SSCG framework provides a unifying, extensible representation paradigm that subsumes classical dependency and collaboration graphs, incorporates semantic similarity, and supports compositional reasoning at arbitrary granularity. By bridging the gap between symbolic program structure and learned vector representations, SSCGs have become essential infrastructure for advanced code retrieval, comprehension, and synthesis systems. Ongoing research is expected to further strengthen the theoretical underpinnings, computational scalability, and practical adoption of SSCGs across programming languages and software engineering pipelines (Li et al., 14 Apr 2025, Wang et al., 7 Sep 2025, Wu et al., 29 Feb 2024, Borowski et al., 2023, Borowski et al., 2023).