VulRTex: Automated Vulnerability Detection

Updated 19 November 2025

VulRTex is a reasoning-guided framework that automates vulnerability detection using chain-of-thought LLM reasoning and retrieval-augmented case guidance.
It employs a four-stage pipeline including preprocessing, reasoning graph construction, random-walk subgraph retrieval, and LLM-based classification.
Experimental results demonstrate superior performance and early detection capabilities, accelerating vulnerability disclosure in major open-source projects.

VulRTex is a reasoning-guided framework designed to automate the identification of vulnerability-related issue reports (IRs), particularly leveraging rich-text elements such as screenshots and code snippets. By integrating chain-of-thought (CoT) LLM reasoning with retrieval-augmented guidance based on a database of historical vulnerability cases, VulRTex achieves state-of-the-art performance for vulnerability detection and CWE-ID classification in open-source software issue repositories, notably excelling under severe data imbalance (Jiang et al., 4 Sep 2025).

1. System Architecture and Component Workflow

VulRTex orchestrates a four-stage pipeline encompassing preprocessing, reasoning database construction, case retrieval and guidance prompt generation, and LLM-based vulnerability identification.

Preprocessing Module

Crawls and normalizes raw IRs, tagging rich-text modalities: [[SCR](https://www.emergentmind.com/topics/scene-coordinate-regression-scr)] for screenshots, [[CODE](https://www.emergentmind.com/topics/chaosode-code)] for code snippets.
Emits unified JSON records for subsequent modules.

Vulnerability Reasoning Database Construction

For each historical IR, an LLM CoT agent constructs a reasoning graph $\mathcal{G}$ detailing stepwise how the vulnerability is substantiated by rich-text.
Each step is fact-checked and augmented using external Vulnerability Awareness (VA) datasets (e.g., VDISC, OWASP), employing TF–IDF-based retrieval to inject golden facts and correct hallucinated outputs.
Final graphs are persisted in the Reasoning Base.

Case Retrieval Engine

For a new IR, each historical reasoning graph is pruned via a weighted random walk (see §3.1 for the algorithmic details) to extract the most relevant subgraph.
Subgraphs are ranked against the target IR using TF–IDF similarity; top- $k$ cases (typically $k=3$ ) are selected as guides.

LLM Reasoning Module

Combines retrieved subgraph narratives into a guidance prompt $P_{\mathrm{guide}}$ , concatenated with a classification prompt $P_{\mathrm{identify}}$ .
The fused prompt is input to the LLM for binary vulnerability classification and CWE-ID prediction.

VulRTex High-Level Pipeline Pseudocode

\begin{algorithm}[H]
\caption{VulRTex Pipeline}
\KwIn{Raw historical IRs %%%%5%%%%, Target IRs %%%%6%%%%}
\KwOut{For each %%%%7%%%%: (vulnerability? label, predicted CWE-ID)}
// 1. Preprocessing
\ForEach{%%%%8%%%% in history}{
   Clean, tag rich-text, output JSON.
}
\ForEach{%%%%9%%%% in target}{
   Same preprocessing.
}
// 2. Reasoning DB Construction
\ForEach{%%%%10%%%% in history}{
   %%%%11%%%% GenerateGraph(%%%%12%%%%)
   Reason\_Base.add(%%%%13%%%%)
}
// 3. Retrieval and Pruning
\ForEach{%%%%14%%%% in target}{
   Retrieve top-%%%%15%%%% subgraphs from Reason\_Base
}
// 4. LLM Reasoning
\ForEach{%%%%16%%%%}{
   %%%%17%%%% MakeGuidancePrompt(%%%%18%%%%, cases)
   %%%%19%%%% classification prompt
   %%%%20%%%% = LLM(%%%%21%%%%)
}
\Return results
\end{algorithm}

2. Vulnerability Reasoning Database: CoT Graph Construction and Fact Correction

Reasoning graphs $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ model the interpretive sequence for each historical IR:

Vertices $\mathcal{V}$ correspond to observations (summaries of comments, screenshots, or code).
Edges $\mathcal{E}$ represent actions taken by the LLM agent, selected from {\texttt{ScrAnalyzer}, \texttt{CodeAnalyzer}, \texttt{AgentTerminator}}.

At each step $t$ , the agent operates on context $C_t$ and produces

$C_t = (O_1, A_1, O_2, A_2, \dots, O_t)$

To mitigate hallucination, each step’s context is expanded with authoritative knowledge: $\mathrm{Know}_t = \{k \in \mathcal{D} : \mathrm{TFIDF}(k, \mathrm{Desc}(C_t)) > \theta_{\mathrm{sim}}\}$ where $\mathcal{D}$ aggregates data from KB, BigVul, OWASP, Debian, and VDISC.

A CoT prompt template is used:

Please think step by step. For each step, select related rich-text elements (comment, [SCR], [CODE]), output an "Observation", and choose a "Tool" (Action) to proceed. Terminate when vulnerability is identified.

Paths through these graphs encode distinct reasoning chains for each IR.

3. Case Retrieval: Random-Walk Subgraph Pruning and Guided Prompt Formulation

Pruning each historical reasoning graph for retrieval is defined by a weighted random-walk:

Edge weights are computed as increments in TF–IDF similarity between the endpoints and the target IR: $\tilde{M}_{i,j} = \mathrm{TFIDF}(O_j, IR^{\text{tgt}}) - \mathrm{TFIDF}(O_i, IR^{\text{tgt}})$
Edge sampling probability: $p(O_i, O_j) = \frac{\frac{1}{\deg(O_i)} + \frac{1}{\deg(O_j)}}{\sum_{(u,v) \in \mathcal{E}} (\frac{1}{\deg(u)} + \frac{1}{\deg(v)})}$
The process halts upon reaching an AgentTerminator edge; the visited nodes constitute the retrieved subgraph $\mathcal{G}_r$ .

Candidate subgraphs are ranked by TF–IDF similarity to the target IR’s full text and rich-text components; top- $k$ are chosen.

Guidance prompts $P_{\mathrm{guide}}$ are constructed by concatenating path summaries from these top cases with procedural instructions for new IR analysis.

4. LLM-Guided Vulnerability Identification and Output Decoding

The LLM receives the prompt $P = P_{\mathrm{guide}} \oplus P_{\mathrm{identify}}$ , where the identification portion is standardized:

Please classify whether the following IR contains a vulnerability (Yes/No). If Yes, output the CWE-ID. Format: {Yes/No}, {CWE-xxx}.

The model produces both a binary label and CWE-ID. In an illustrative case (AntSword #147), the system robustly identifies XSS (CWE-79) by analogical reasoning along retrieved case paths.

5. Experimental Design, Evaluation, and Results

Dataset: 973,572 GitHub IRs (2015–2023); 4,002 vulnerability-related (39.1% with rich-text), 1,191,318 negative-class IRs. The exploitation of rich-text is crucial, as 39.1% of vulnerability IRs contain [SCR] or [CODE].

Split: 60% of vul-positives for reasoning database construction (2,401 IRs), 40% for target evaluation (1,601 IRs). The negative class remains highly imbalanced.

Split	#IRs Total	Rich-Text	[SCR]	[CODE]
Vul (hist)	2,401	720	527	630
Vul (tgt)	1,601	966	670	767
Non-Vul (h)	714,790	466,179	352,675	330,753
Non-Vul (t)	476,528	201,391	139,578	276,639

Metrics: Standard precision, recall, and F1; AUPRC and AUROC (addressing extreme class imbalance); Macro-F1 for multi-class CWE-ID prediction.

Baselines: Logistic Regression, MLP, MemVul; zero-shot LLMs (LLaMA-2-13B, GPT-3, ChatGPT) with and without reasoning extensions (CoT-SC, ReAct, DS-Agent).

Model Category	Method	Time(s/IR)	Precision	Recall	F1	AUROC	AUPRC	Macro-F1
LLaMA-2	+VulRTex	7.9	74.5	75.3	74.9	97.9	65.9	70.4
ChatGPT	+VulRTex	11.7	90.2	87.7	88.9	99.2	80.4	83.7

VulRTex-augmented ChatGPT achieved F1 = 88.9%, AUPRC = 80.4%, and Macro-F1 = 83.7%, outperforming the best baseline by +11.0%, +20.2%, and +10.5% respectively. Reasoning latency was approximately half that of ReAct or DS-Agent.

Ablations revealed critical performance drops without [SCR] analysis (−17.6% F1), random-walk pruning (−19.2% F1), or fact-correction (−12.3% F1).

6. Real-World Deployment and Observed Effectiveness

Tracking 2,098 new IRs in 2024 across ten major OSS projects (e.g., React, Kubernetes, Hugo), VulRTex identified 30 vulnerability-related IRs; 11 (36.7%) of these were assigned CVE-IDs prior to public disclosure (mean lead time, 35 days). In several cases (e.g., Carla #7025 → CWE-119 → CVE-2024-33903), VulRTex surfaced vulnerabilities over 100 days before CVE assignment, illustrating its operational viability.

Project	#2024 IRs	#Identified	#CVE-Assigned (%)
Rimedo-ts	1	1	1/1 (100%)
Carla	300	6	1/6 (16.7%)
React	197	4	0/4 (0%)
Kubernetes	597	3	1/3 (33.3%)
Total	2,098	30	11/30 (36.7%)

A plausible implication is that integrating VulRTex into large-scale issue triage pipelines could materially accelerate vulnerability disclosure and mitigation timelines across diverse OSS ecosystems.

7. Limitations and Prospective Enhancements

Current limitations of VulRTex include the inability to process arbitrary video or audio streams, and dependence on the continued availability of IR page content (private/deleted IRs are out-of-scope). Directions for future improvement include integration with commit/patch analysis as additional reasoning evidence, extension to alternative collaborative platforms (e.g., GitLab, Gitee), and exploration of multimodal (e.g., CLIP-based) embeddings for richer screenshot processing.

VulRTex is, to date, the first framework to combine chain-of-thought LLM reasoning with retrieval-augmented, case-driven vulnerability graph retrieval for the practical and precise identification of vulnerabilities in rich-text OSS issue reports (Jiang et al., 4 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

VulRTex: A Reasoning-Guided Approach to Identify Vulnerabilities from Rich-Text Issue Report (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VulRTex.