VulRTex: Automated Vulnerability Detection
- VulRTex is a reasoning-guided framework that automates vulnerability detection using chain-of-thought LLM reasoning and retrieval-augmented case guidance.
- It employs a four-stage pipeline including preprocessing, reasoning graph construction, random-walk subgraph retrieval, and LLM-based classification.
- Experimental results demonstrate superior performance and early detection capabilities, accelerating vulnerability disclosure in major open-source projects.
VulRTex is a reasoning-guided framework designed to automate the identification of vulnerability-related issue reports (IRs), particularly leveraging rich-text elements such as screenshots and code snippets. By integrating chain-of-thought (CoT) LLM reasoning with retrieval-augmented guidance based on a database of historical vulnerability cases, VulRTex achieves state-of-the-art performance for vulnerability detection and CWE-ID classification in open-source software issue repositories, notably excelling under severe data imbalance (Jiang et al., 4 Sep 2025).
1. System Architecture and Component Workflow
VulRTex orchestrates a four-stage pipeline encompassing preprocessing, reasoning database construction, case retrieval and guidance prompt generation, and LLM-based vulnerability identification.
Preprocessing Module
- Crawls and normalizes raw IRs, tagging rich-text modalities:
[[SCR](https://www.emergentmind.com/topics/scene-coordinate-regression-scr)]for screenshots,[CODE]for code snippets. - Emits unified JSON records for subsequent modules.
Vulnerability Reasoning Database Construction
- For each historical IR, an LLM CoT agent constructs a reasoning graph detailing stepwise how the vulnerability is substantiated by rich-text.
- Each step is fact-checked and augmented using external Vulnerability Awareness (VA) datasets (e.g., VDISC, OWASP), employing TF–IDF-based retrieval to inject golden facts and correct hallucinated outputs.
- Final graphs are persisted in the Reasoning Base.
Case Retrieval Engine
- For a new IR, each historical reasoning graph is pruned via a weighted random walk (see §3.1 for the algorithmic details) to extract the most relevant subgraph.
- Subgraphs are ranked against the target IR using TF–IDF similarity; top- cases (typically ) are selected as guides.
LLM Reasoning Module
- Combines retrieved subgraph narratives into a guidance prompt , concatenated with a classification prompt .
- The fused prompt is input to the LLM for binary vulnerability classification and CWE-ID prediction.
VulRTex High-Level Pipeline Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
\begin{algorithm}[H]
\caption{VulRTex Pipeline}
\KwIn{Raw historical IRs %%%%5%%%%, Target IRs %%%%6%%%%}
\KwOut{For each %%%%7%%%%: (vulnerability? label, predicted CWE-ID)}
// 1. Preprocessing
\ForEach{%%%%8%%%% in history}{
Clean, tag rich-text, output JSON.
}
\ForEach{%%%%9%%%% in target}{
Same preprocessing.
}
// 2. Reasoning DB Construction
\ForEach{%%%%10%%%% in history}{
%%%%11%%%% GenerateGraph(%%%%12%%%%)
Reason\_Base.add(%%%%13%%%%)
}
// 3. Retrieval and Pruning
\ForEach{%%%%14%%%% in target}{
Retrieve top-%%%%15%%%% subgraphs from Reason\_Base
}
// 4. LLM Reasoning
\ForEach{%%%%16%%%%}{
%%%%17%%%% MakeGuidancePrompt(%%%%18%%%%, cases)
%%%%19%%%% classification prompt
%%%%20%%%% = LLM(%%%%21%%%%)
}
\Return results
\end{algorithm} |
2. Vulnerability Reasoning Database: CoT Graph Construction and Fact Correction
Reasoning graphs model the interpretive sequence for each historical IR:
- Vertices correspond to observations (summaries of comments, screenshots, or code).
- Edges represent actions taken by the LLM agent, selected from {\texttt{ScrAnalyzer}, \texttt{CodeAnalyzer}, \texttt{AgentTerminator}}.
At each step , the agent operates on context and produces
To mitigate hallucination, each step’s context is expanded with authoritative knowledge: where aggregates data from KB, BigVul, OWASP, Debian, and VDISC.
A CoT prompt template is used:
Please think step by step. For each step, select related rich-text elements (comment, [SCR], [CODE]), output an "Observation", and choose a "Tool" (Action) to proceed. Terminate when vulnerability is identified.
Paths through these graphs encode distinct reasoning chains for each IR.
3. Case Retrieval: Random-Walk Subgraph Pruning and Guided Prompt Formulation
Pruning each historical reasoning graph for retrieval is defined by a weighted random-walk:
- Edge weights are computed as increments in TF–IDF similarity between the endpoints and the target IR:
- Edge sampling probability:
- The process halts upon reaching an AgentTerminator edge; the visited nodes constitute the retrieved subgraph .
Candidate subgraphs are ranked by TF–IDF similarity to the target IR’s full text and rich-text components; top- are chosen.
Guidance prompts are constructed by concatenating path summaries from these top cases with procedural instructions for new IR analysis.
4. LLM-Guided Vulnerability Identification and Output Decoding
The LLM receives the prompt , where the identification portion is standardized:
Please classify whether the following IR contains a vulnerability (Yes/No). If Yes, output the CWE-ID. Format: {Yes/No}, {CWE-xxx}.
The model produces both a binary label and CWE-ID. In an illustrative case (AntSword #147), the system robustly identifies XSS (CWE-79) by analogical reasoning along retrieved case paths.
5. Experimental Design, Evaluation, and Results
Dataset: 973,572 GitHub IRs (2015–2023); 4,002 vulnerability-related (39.1% with rich-text), 1,191,318 negative-class IRs. The exploitation of rich-text is crucial, as 39.1% of vulnerability IRs contain [SCR] or [CODE].
Split: 60% of vul-positives for reasoning database construction (2,401 IRs), 40% for target evaluation (1,601 IRs). The negative class remains highly imbalanced.
| Split | #IRs Total | Rich-Text | [SCR] | [CODE] |
|---|---|---|---|---|
| Vul (hist) | 2,401 | 720 | 527 | 630 |
| Vul (tgt) | 1,601 | 966 | 670 | 767 |
| Non-Vul (h) | 714,790 | 466,179 | 352,675 | 330,753 |
| Non-Vul (t) | 476,528 | 201,391 | 139,578 | 276,639 |
Metrics: Standard precision, recall, and F1; AUPRC and AUROC (addressing extreme class imbalance); Macro-F1 for multi-class CWE-ID prediction.
Baselines: Logistic Regression, MLP, MemVul; zero-shot LLMs (LLaMA-2-13B, GPT-3, ChatGPT) with and without reasoning extensions (CoT-SC, ReAct, DS-Agent).
| Model Category | Method | Time(s/IR) | Precision | Recall | F1 | AUROC | AUPRC | Macro-F1 |
|---|---|---|---|---|---|---|---|---|
| LLaMA-2 | +VulRTex | 7.9 | 74.5 | 75.3 | 74.9 | 97.9 | 65.9 | 70.4 |
| ChatGPT | +VulRTex | 11.7 | 90.2 | 87.7 | 88.9 | 99.2 | 80.4 | 83.7 |
VulRTex-augmented ChatGPT achieved F1 = 88.9%, AUPRC = 80.4%, and Macro-F1 = 83.7%, outperforming the best baseline by +11.0%, +20.2%, and +10.5% respectively. Reasoning latency was approximately half that of ReAct or DS-Agent.
Ablations revealed critical performance drops without [SCR] analysis (−17.6% F1), random-walk pruning (−19.2% F1), or fact-correction (−12.3% F1).
6. Real-World Deployment and Observed Effectiveness
Tracking 2,098 new IRs in 2024 across ten major OSS projects (e.g., React, Kubernetes, Hugo), VulRTex identified 30 vulnerability-related IRs; 11 (36.7%) of these were assigned CVE-IDs prior to public disclosure (mean lead time, 35 days). In several cases (e.g., Carla #7025 → CWE-119 → CVE-2024-33903), VulRTex surfaced vulnerabilities over 100 days before CVE assignment, illustrating its operational viability.
| Project | #2024 IRs | #Identified | #CVE-Assigned (%) |
|---|---|---|---|
| Rimedo-ts | 1 | 1 | 1/1 (100%) |
| Carla | 300 | 6 | 1/6 (16.7%) |
| React | 197 | 4 | 0/4 (0%) |
| Kubernetes | 597 | 3 | 1/3 (33.3%) |
| Total | 2,098 | 30 | 11/30 (36.7%) |
A plausible implication is that integrating VulRTex into large-scale issue triage pipelines could materially accelerate vulnerability disclosure and mitigation timelines across diverse OSS ecosystems.
7. Limitations and Prospective Enhancements
Current limitations of VulRTex include the inability to process arbitrary video or audio streams, and dependence on the continued availability of IR page content (private/deleted IRs are out-of-scope). Directions for future improvement include integration with commit/patch analysis as additional reasoning evidence, extension to alternative collaborative platforms (e.g., GitLab, Gitee), and exploration of multimodal (e.g., CLIP-based) embeddings for richer screenshot processing.
VulRTex is, to date, the first framework to combine chain-of-thought LLM reasoning with retrieval-augmented, case-driven vulnerability graph retrieval for the practical and precise identification of vulnerabilities in rich-text OSS issue reports (Jiang et al., 4 Sep 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free