Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Risk Code Graph (SRCG)

Updated 30 April 2026
  • Semantic Risk Code Graph (SRCG) is a typed, annotated graph that captures semantically meaningful risk relationships between code elements.
  • It integrates static analysis, control/data-flow extraction, and rule-based semantic labeling to enable explainable threat detection.
  • SRCG-based methods have demonstrated high accuracy in smart contract risk detection and efficient malicious code localization in large software projects.

A Semantic Risk Code Graph (SRCG) is a representation of code as a typed, annotated graph designed to explicitly capture semantically meaningful risk or threat relationships between code elements. SRCG-based approaches integrate static code structure, data/control-flow, and rule-based or learned semantic labels, serving as a substrate for advanced program analysis, explainable threat detection, and graph-based representation learning. Prominent applications include smart contract risk detection—most notably for rug-pull schemes in crypto tokens—and broader malicious code localization in large software distributions (Wu et al., 23 Jun 2025, Gao et al., 19 Jan 2026).

1. Formal Definition

The SRCG is formally defined as a directed, labeled graph whose nodes correspond to program elements and whose edges model semantically significant relations (such as control-flow, data-flow, definition, call, inheritance, or decoration, depending on the language and context):

  • For Ethereum smart contracts, the SRCG is defined as

SRCG=(V,E,Φ),\mathrm{SRCG} = (V, E, \Phi),

where VV is the set of risk-related code elements (typically one-to-one with IR basic blocks), EV×VE \subseteq V \times V is the union of control-flow and data-flow edges, and Φ:R2{labels onVE}\Phi: R \to 2^{\{\text{labels on}\, V \cup E\}} maps risk-detection rules RR to semantic attributes stamped on nodes or edges when a rule fires (Wu et al., 23 Jun 2025).

  • For general-purpose code (e.g., Python projects), SRCG is defined as

G=(V,E),G = (\mathcal{V}, \mathcal{E}),

with nodes for modules, classes, functions, and expressions, and edges for definition, inheritance, decoration, and calls. Node features xvx_v combine learned embeddings of identifier, syntactic type, and a multi-hot vector of “sensitive behaviors” corresponding to rules gathered via LLM prompting and data-driven mining (Gao et al., 19 Jan 2026).

This abstraction permits encoding both explicit program structure and high-level semantic risks.

2. Construction Methodology

SRCG construction integrates structural program analysis with declarative or learned risk semantics. Construction workflows differ based on target environment:

  • Smart Contracts (RPHunter):
  1. Bytecode Lifting: Using decompilation (Gigahorse), bytecode is transformed into IR basic blocks.
  2. Control & Data Flow Extraction: Control-flow (CFG) and data-flow edges are extracted via analysis of PUSH/POP/STORE/LOAD instructions.
  3. Rule Application: Declarative Datalog-style rules, expressing observed backdoor or risk patterns, are executed on the code graph. Each rule is a conjunction over flow and value-comparison predicates; nodes and edges involved become “critical.”
  4. Flow Analysis: Data-flow reachability searches locate the propagation of tainted or privileged values, iteratively expanding the set of critical blocks and flows until a fixpoint.
  5. Label Generation: Nodes and edges are augmented with “critical,” “invocation” (internal call), “dependent,” or “normal” labels according to rule matches and flow analysis (Wu et al., 23 Jun 2025).
  • General-purpose Code:
  1. Parse source files into ASTs.
  2. Construct nodes for file/module, class, function, and—optionally—expression elements.
  3. Instantiate four edge types: definition, inheritance, decoration, and call relations.
  4. Extract behavioral features for each node via LLM-driven rule mining and static analysis.
  5. Annotate nodes with multi-hot vectors reflecting matched sensitive rules (Gao et al., 19 Jan 2026).

This methodological framework ensures that SRCGs are semantic-rich and adaptable to both domain-specific and general tasks.

3. Node, Edge, and Label Schema

SRCGs employ a discrete node/edge labeling scheme to reflect semantic context and risk provenance:

Context Node Types Edge Types Label Semantics
Smart contracts (Wu et al., 23 Jun 2025) critical, invocation, normal critical, dependent, normal Backdoor check, call, or neutral control/data flow
General code (Gao et al., 19 Jan 2026) module, class, function, expression def/inherit/decorate/call Sensitive behaviors, structure, decorations
  • Smart contract nodes labeled “critical” participate in dangerous logic (sale/balance restriction, hidden mint), “invocation” denotes internal calls, and “normal” for generic blocks. “Critical” edges reflect rule-matched flows; “dependent”/“normal” further differentiate context.
  • General software SRCGs annotate nodes with behavior vectors denoting semantic or security-relevant properties mined by LLMs and data-driven methods.

Semantic labeling enables later explainability and targeted analysis, distinguishing SRCGs from raw CFGs or ASTs.

4. SRCG for Representation Learning and Threat Detection

SRCGs serve as input to specialized GNNs for risk detection and localization:

  • Node Embedding: In smart contracts, each basic block’s opcodes are embedded via a pre-trained 12-layer BERT (h0(b)R36\mathbf{h}_0(b)\in\mathbb{R}^{36}) (Wu et al., 23 Jun 2025). In general code, node features combine embedding of names, types, and multi-hot sensitive-behavior vectors (Gao et al., 19 Jan 2026).
  • Graph Neural Networks:

    • Smart contract SRCGs adopt a relational GCN treating labeled edge types (critical/dependent/normal) as distinct relations, executing two rounds of message passing to produce graph-level embeddings HcodeH_{\mathrm{code}} for classification (Wu et al., 23 Jun 2025).
    • In broader contexts, a two-layer GCN performs package-level classification, with hidden states updated as

    hv(l+1)=σ(uN(v)W(l)hu(l)+b(l)),h_v^{(l+1)} = \sigma \left( \sum_{u\in\mathcal{N}(v)} W^{(l)} h_u^{(l)} + b^{(l)} \right),

    and binary cross-entropy loss for package-level maliciousness (Gao et al., 19 Jan 2026).

  • Explainability: SRCGs enable GNN-based explainers through mask optimization (GNNExplainer framework). Attention scores VV0 identify high-risk nodes and edges, yielding high-attention subgraphs that pinpoint suspicious logic and inform downstream LLM explanations (Gao et al., 19 Jan 2026).

This architecture supports large-scale, explainable risk detection with improved accuracy and interpretability over purely pattern- or transaction-based approaches.

5. Applications and Empirical Performance

The SRCG paradigm underpins state-of-the-art systems across domains:

  • Smart Contract Rug-Pull Detection (RPHunter): SRCG-based analysis detects fine-grained backdoor code paths (e.g., sale restriction, hidden mint, balance tamper) with high precision and recall (95.3%/93.8% on a 645-token rug pull ground truth, F1=94.5%, 1.8% FPR) (Wu et al., 23 Jun 2025). Proactive detection is possible prior to suspicious transaction activity.
  • General Malicious Code Localization: Efficient SRCG + GNN pipelines reduce the context needed for LLM-based code scrutiny by two orders of magnitude (640 tokens vs. 250k tokens for large Python packages), yielding substantial improvements in detection accuracy (95.6% acc. on MalCP, 94.3% on Backstabbers) and factual explanation quality (LLM explanation score ≈3.2) (Gao et al., 19 Jan 2026).

Key advantages include proactive vulnerability detection, fine-grained semantic localization, and dramatic reductions in false positive rates compared to static pattern matching.

6. Limitations and Future Directions

A core limitation of SRCG-based methods is their reliance on the correctness and granularity of underlying code representations and flow analyses:

  • Decompilation and IR analysis tools (e.g., Gigahorse) may struggle with highly obfuscated bytecode or adversarial EVM code, potentially allowing sophisticated attacks to evade critical block/edge marking (Wu et al., 23 Jun 2025).
  • For general code, coverage of sensitive behaviors depends on rule mining and LLM prompt engineering; uncommon or novel malicious patterns may be missed if not represented in the rulebase or training data (Gao et al., 19 Jan 2026).

Continued advancements may include more robust decompilation, richer rule languages, integration with dynamic/tracing data, and transfer learning for unseen attack variants.

7. Relation to Other Graph-based Code Representations

The SRCG extends standard static code representations (CFGs, data-flow graphs, ASTs) by integrating context-sensitive, risk-focused annotation and providing a bridge to explainable AI analyses:

  • Unlike pattern-only or transaction-only static checkers, SRCG architectures contextualize operations within the true program flow, vastly improving specificity and reducing false detection rates (e.g., 1.8% FPR vs. 10–50% for naïve static checkers) (Wu et al., 23 Jun 2025).
  • SRCGs have been shown to amplify LLM-based code audits, focusing expensive token-limited attention windows on the most suspicious subgraphs, thus bolstering both interpretability and computational efficiency (Gao et al., 19 Jan 2026).

This hybrid semantic-structural approach is foundational for explainable software risk analysis in both specialized and general-purpose security domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Risk Code Graph (SRCG).