Papers
Topics
Authors
Recent
2000 character limit reached

VulRTex: Automated Vulnerability Detection

Updated 19 November 2025
  • VulRTex is a reasoning-guided framework that automates vulnerability detection using chain-of-thought LLM reasoning and retrieval-augmented case guidance.
  • It employs a four-stage pipeline including preprocessing, reasoning graph construction, random-walk subgraph retrieval, and LLM-based classification.
  • Experimental results demonstrate superior performance and early detection capabilities, accelerating vulnerability disclosure in major open-source projects.

VulRTex is a reasoning-guided framework designed to automate the identification of vulnerability-related issue reports (IRs), particularly leveraging rich-text elements such as screenshots and code snippets. By integrating chain-of-thought (CoT) LLM reasoning with retrieval-augmented guidance based on a database of historical vulnerability cases, VulRTex achieves state-of-the-art performance for vulnerability detection and CWE-ID classification in open-source software issue repositories, notably excelling under severe data imbalance (Jiang et al., 4 Sep 2025).

1. System Architecture and Component Workflow

VulRTex orchestrates a four-stage pipeline encompassing preprocessing, reasoning database construction, case retrieval and guidance prompt generation, and LLM-based vulnerability identification.

Preprocessing Module

  • Crawls and normalizes raw IRs, tagging rich-text modalities: [[SCR](https://www.emergentmind.com/topics/scene-coordinate-regression-scr)] for screenshots, [CODE] for code snippets.
  • Emits unified JSON records for subsequent modules.

Vulnerability Reasoning Database Construction

  • For each historical IR, an LLM CoT agent constructs a reasoning graph G\mathcal{G} detailing stepwise how the vulnerability is substantiated by rich-text.
  • Each step is fact-checked and augmented using external Vulnerability Awareness (VA) datasets (e.g., VDISC, OWASP), employing TF–IDF-based retrieval to inject golden facts and correct hallucinated outputs.
  • Final graphs are persisted in the Reasoning Base.

Case Retrieval Engine

  • For a new IR, each historical reasoning graph is pruned via a weighted random walk (see §3.1 for the algorithmic details) to extract the most relevant subgraph.
  • Subgraphs are ranked against the target IR using TF–IDF similarity; top-kk cases (typically k=3k=3) are selected as guides.

LLM Reasoning Module

  • Combines retrieved subgraph narratives into a guidance prompt PguideP_{\mathrm{guide}}, concatenated with a classification prompt PidentifyP_{\mathrm{identify}}.
  • The fused prompt is input to the LLM for binary vulnerability classification and CWE-ID prediction.

VulRTex High-Level Pipeline Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
\begin{algorithm}[H]
\caption{VulRTex Pipeline}
\KwIn{Raw historical IRs %%%%5%%%%, Target IRs %%%%6%%%%}
\KwOut{For each %%%%7%%%%: (vulnerability? label, predicted CWE-ID)}
// 1. Preprocessing
\ForEach{%%%%8%%%% in history}{
   Clean, tag rich-text, output JSON.
}
\ForEach{%%%%9%%%% in target}{
   Same preprocessing.
}
// 2. Reasoning DB Construction
\ForEach{%%%%10%%%% in history}{
   %%%%11%%%% GenerateGraph(%%%%12%%%%)
   Reason\_Base.add(%%%%13%%%%)
}
// 3. Retrieval and Pruning
\ForEach{%%%%14%%%% in target}{
   Retrieve top-%%%%15%%%% subgraphs from Reason\_Base
}
// 4. LLM Reasoning
\ForEach{%%%%16%%%%}{
   %%%%17%%%% MakeGuidancePrompt(%%%%18%%%%, cases)
   %%%%19%%%% classification prompt
   %%%%20%%%% = LLM(%%%%21%%%%)
}
\Return results
\end{algorithm}

2. Vulnerability Reasoning Database: CoT Graph Construction and Fact Correction

Reasoning graphs G=(V,E)\mathcal{G}=(\mathcal{V}, \mathcal{E}) model the interpretive sequence for each historical IR:

  • Vertices V\mathcal{V} correspond to observations (summaries of comments, screenshots, or code).
  • Edges E\mathcal{E} represent actions taken by the LLM agent, selected from {\texttt{ScrAnalyzer}, \texttt{CodeAnalyzer}, \texttt{AgentTerminator}}.

At each step tt, the agent operates on context CtC_t and produces

Ct=(O1,A1,O2,A2,,Ot)C_t = (O_1, A_1, O_2, A_2, \dots, O_t)

To mitigate hallucination, each step’s context is expanded with authoritative knowledge: Knowt={kD:TFIDF(k,Desc(Ct))>θsim}\mathrm{Know}_t = \{k \in \mathcal{D} : \mathrm{TFIDF}(k, \mathrm{Desc}(C_t)) > \theta_{\mathrm{sim}}\} where D\mathcal{D} aggregates data from KB, BigVul, OWASP, Debian, and VDISC.

A CoT prompt template is used:

Please think step by step. For each step, select related rich-text elements (comment, [SCR], [CODE]), output an "Observation", and choose a "Tool" (Action) to proceed. Terminate when vulnerability is identified.

Paths through these graphs encode distinct reasoning chains for each IR.

3. Case Retrieval: Random-Walk Subgraph Pruning and Guided Prompt Formulation

Pruning each historical reasoning graph for retrieval is defined by a weighted random-walk:

  • Edge weights are computed as increments in TF–IDF similarity between the endpoints and the target IR: M~i,j=TFIDF(Oj,IRtgt)TFIDF(Oi,IRtgt)\tilde{M}_{i,j} = \mathrm{TFIDF}(O_j, IR^{\text{tgt}}) - \mathrm{TFIDF}(O_i, IR^{\text{tgt}})
  • Edge sampling probability: p(Oi,Oj)=1deg(Oi)+1deg(Oj)(u,v)E(1deg(u)+1deg(v))p(O_i, O_j) = \frac{\frac{1}{\deg(O_i)} + \frac{1}{\deg(O_j)}}{\sum_{(u,v) \in \mathcal{E}} (\frac{1}{\deg(u)} + \frac{1}{\deg(v)})}
  • The process halts upon reaching an AgentTerminator edge; the visited nodes constitute the retrieved subgraph Gr\mathcal{G}_r.

Candidate subgraphs are ranked by TF–IDF similarity to the target IR’s full text and rich-text components; top-kk are chosen.

Guidance prompts PguideP_{\mathrm{guide}} are constructed by concatenating path summaries from these top cases with procedural instructions for new IR analysis.

4. LLM-Guided Vulnerability Identification and Output Decoding

The LLM receives the prompt P=PguidePidentifyP = P_{\mathrm{guide}} \oplus P_{\mathrm{identify}}, where the identification portion is standardized:

Please classify whether the following IR contains a vulnerability (Yes/No). If Yes, output the CWE-ID. Format: {Yes/No}, {CWE-xxx}.

The model produces both a binary label and CWE-ID. In an illustrative case (AntSword #147), the system robustly identifies XSS (CWE-79) by analogical reasoning along retrieved case paths.

5. Experimental Design, Evaluation, and Results

Dataset: 973,572 GitHub IRs (2015–2023); 4,002 vulnerability-related (39.1% with rich-text), 1,191,318 negative-class IRs. The exploitation of rich-text is crucial, as 39.1% of vulnerability IRs contain [SCR] or [CODE].

Split: 60% of vul-positives for reasoning database construction (2,401 IRs), 40% for target evaluation (1,601 IRs). The negative class remains highly imbalanced.

Split #IRs Total Rich-Text [SCR] [CODE]
Vul (hist) 2,401 720 527 630
Vul (tgt) 1,601 966 670 767
Non-Vul (h) 714,790 466,179 352,675 330,753
Non-Vul (t) 476,528 201,391 139,578 276,639

Metrics: Standard precision, recall, and F1; AUPRC and AUROC (addressing extreme class imbalance); Macro-F1 for multi-class CWE-ID prediction.

Baselines: Logistic Regression, MLP, MemVul; zero-shot LLMs (LLaMA-2-13B, GPT-3, ChatGPT) with and without reasoning extensions (CoT-SC, ReAct, DS-Agent).

Model Category Method Time(s/IR) Precision Recall F1 AUROC AUPRC Macro-F1
LLaMA-2 +VulRTex 7.9 74.5 75.3 74.9 97.9 65.9 70.4
ChatGPT +VulRTex 11.7 90.2 87.7 88.9 99.2 80.4 83.7

VulRTex-augmented ChatGPT achieved F1 = 88.9%, AUPRC = 80.4%, and Macro-F1 = 83.7%, outperforming the best baseline by +11.0%, +20.2%, and +10.5% respectively. Reasoning latency was approximately half that of ReAct or DS-Agent.

Ablations revealed critical performance drops without [SCR] analysis (−17.6% F1), random-walk pruning (−19.2% F1), or fact-correction (−12.3% F1).

6. Real-World Deployment and Observed Effectiveness

Tracking 2,098 new IRs in 2024 across ten major OSS projects (e.g., React, Kubernetes, Hugo), VulRTex identified 30 vulnerability-related IRs; 11 (36.7%) of these were assigned CVE-IDs prior to public disclosure (mean lead time, 35 days). In several cases (e.g., Carla #7025 → CWE-119 → CVE-2024-33903), VulRTex surfaced vulnerabilities over 100 days before CVE assignment, illustrating its operational viability.

Project #2024 IRs #Identified #CVE-Assigned (%)
Rimedo-ts 1 1 1/1 (100%)
Carla 300 6 1/6 (16.7%)
React 197 4 0/4 (0%)
Kubernetes 597 3 1/3 (33.3%)
Total 2,098 30 11/30 (36.7%)

A plausible implication is that integrating VulRTex into large-scale issue triage pipelines could materially accelerate vulnerability disclosure and mitigation timelines across diverse OSS ecosystems.

7. Limitations and Prospective Enhancements

Current limitations of VulRTex include the inability to process arbitrary video or audio streams, and dependence on the continued availability of IR page content (private/deleted IRs are out-of-scope). Directions for future improvement include integration with commit/patch analysis as additional reasoning evidence, extension to alternative collaborative platforms (e.g., GitLab, Gitee), and exploration of multimodal (e.g., CLIP-based) embeddings for richer screenshot processing.

VulRTex is, to date, the first framework to combine chain-of-thought LLM reasoning with retrieval-augmented, case-driven vulnerability graph retrieval for the practical and precise identification of vulnerabilities in rich-text OSS issue reports (Jiang et al., 4 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VulRTex.