Papers
Topics
Authors
Recent
2000 character limit reached

CVSS Corpus: Vulnerability & Speech Data

Updated 26 November 2025
  • CVSS Corpus is a multifaceted collection of large-scale datasets covering both vulnerability narratives and multilingual speech-to-speech translation.
  • It integrates the OSINT corpus for detailed text mining, the base-metric corpus for statistical vulnerability analysis, and a speech corpus for robust translation model development.
  • Data is meticulously curated using web-scraping, NLP, and deep learning techniques to support dynamic risk assessment and advanced ML research.

The acronym CVSS denotes multiple prominent research corpora relating to software vulnerabilities and speech-to-speech translation, each targeting rigorous computational modeling and annotation standards. The most cited corpora are the CVSS‐OSINT corpus for vulnerability text mining and prediction (Kuehn et al., 2022), the CVSS base-metric corpus for statistical vulnerability landscape analysis (Gueye et al., 2021), and the CVSS speech-to-speech corpus for multilingual translation model development (Jia et al., 2022). Each corpus embodies structured, large-scale datasets curated from authoritative sources and underpins state-of-the-art machine learning and statistical methodologies.

1. CVSS‐OSINT Corpus: Vulnerability Description Aggregation

The CVSS‐OSINT corpus is a comprehensive collection of vulnerability narratives constructed to advance machine-learning-based prediction of CVSS vectors and scores. It amalgamates 88,979 CVE (Common Vulnerabilities and Exposures) descriptions harvested from the National Vulnerability Database (NVD) and 12,802 vendor or third-party advisories extracted by focused crawling of nine evaluatively selected domains (IBM, Cisco, Intel, ZDI, Talos, F5, Qualcomm, WPScan, Snyk) cited in NVD references. Domain selection followed three explicit criteria: unique CVE-to-document mapping, uniform DOM structure, and presence of a natural-language abstract.

Structured web-scraping employed Selenium for JavaScript-dependent pages and Beautiful Soup for static content, with robots.txt compliance and a 20s timeout per page. A producer–consumer threading model facilitated parallel crawling, and aggressive boilerplate cleaning isolated core narrative sections. Each document was indexed by CVE‐ID, source_domain, URL, and cleaned natural-language text, yielding a flat CSV/JSONL schema suitable for downstream tasks. Manual validation established over 95% extraction fidelity with respect to relevant vulnerability content.

Domain URLs Pre-selected URLs Fetched (%) Text Mean Length (chars)
IBM 3,447 2,868 (83%) Median = 532
Cisco 3,019 3,004 (99%) Median = 532
ZDI 2,899 2,899 (100%) Median = 532
Talos 1,335 1,201 (89%) Median = 532
Qualcomm 1,048 697 (66%) Median = 532
F5 932 740 (79%) Median = 532
WPScan 803 35 (4%) Median = 532
Intel 771 731 (94%) Median = 532
Snyk 671 627 (93%) Median = 532

The combined corpus, 101,781 records in total, serves as a benchmark dataset for CVSS automation and modeling (Kuehn et al., 2022).

2. CVSS Base-Metric Corpus: Landscape Statistics and Metrics

The CVSS base-metric corpus, as assembled from the NVD for statistical vulnerability landscape analysis, encompasses all CVEs with complete CVSS vectors and scores over the respective periods: CVSS v2 (2005–2019, 118,173 entries) and v3 (2015–2019, 55,441 entries) (Gueye et al., 2021). Each record includes base metrics: Attack Vector (AV), Attack Complexity (AC), Privileges Required (PR), User Interaction (UI), Scope (S), Confidentiality (C), Integrity (I), Availability (A).

For the CVSS v3 corpus (2015–2019), the empirical distribution is strongly right-skewed, with dominant frequencies for AV=Network (90%), AC=Low (85%), PR=None (70%), UI=None (85%), S=Unchanged (95%), and high impact ratings (C/I/A=High: 60%). The empirical base-score distribution diverges from the theoretical bell-shaped expectation due to clustering at a few metric-vector combinations (~10 patterns describe most CVEs). Statistical fits show the real distribution is best modeled by a scaled Beta law (α=2.2, β=1.4), not normal, confirmed by χ² and Kolmogorov–Smirnov tests.

Metric Most Frequent Value Percentage (v3)
AV Network 90.0%
AC Low 85.0%
PR None 70.0%
UI None 85.0%
S Unchanged 95.0%

Temporal trend analysis reveals near-complete stability in metric and base-score distributions over 15 years, with the median CVSS base-score consistently at 7.4 ± 0.2 (Gueye et al., 2021).

3. Automated CVSS Vector Prediction: NLP and Deep Learning Datasets

The CVSS datasets underpin a diverse spectrum of ML-based CVSS prediction research. The CVSS-BERT approach employs the NVD corpus (45,926 CVE descriptions, 2018–2020) to train BERT-small classifiers for each CVSS v3.1 base metric—Attack Vector, Attack Complexity, Privileges Required, User Interaction, Scope, and Confidentiality/Integrity/Availability Impact (Shahid et al., 2021).

Preprocessing consists of WordPiece tokenization, sequence truncation/padding to length 128, and direct labeling from NVD JSON feeds. Model training uses a two-phase regime—classification head training, followed by encoder fine-tuning—with AdamW optimization. Per-metric test accuracy ranges from 0.8379 (PR) to 0.9607 (AC). The severity score is reconstructed as per CVSS v3.1 BaseScore formula:

BaseScore=roundup(min{Impact+Exploitability, 10})\text{BaseScore} = \text{round}_\text{up}\left(\min\{\text{Impact} + \text{Exploitability},\ 10 \}\right)

Mean absolute error between predicted and expert BaseScores is 0.73; 75% of samples are within ±1 of expert annotation. Gradient-based input saliency (“Gradient × Input”) generates interpretable token-level rationales aligning with expert heuristics.

4. Corpus-Driven Metric Enhancements and Time-Dependent Scoring

Corpus analyses have exposed severe clustering and static biases in CVSS vector distributions, leading to methodical proposals for enhanced metric assignment. Petraityte et al. (1807.10435) dissect base-score pathologies (notably “Partial” level collapse) and propose splitting impact into Partial-Application (PA, 0.461) and Partial-System (PS, 0.515) for improved granularity. Exploitability is made time-sensitive, adopting Panjer recursion formula (PRF) and Poisson process modeling for rolling aggregation of “critical points”—proof-of-concept, exploit, patch arrivals.

CVSS=(0.6×Impact+0.4×Exploitability1.5)×f(Impact)\text{CVSS} = (0.6 \times \text{Impact} + 0.4 \times \text{Exploitability} - 1.5) \times f(\text{Impact})

This enhancement enables dynamic risk curves tracking exploitation lifecycle, an advance over the static, clustered native scoring. Simulated time trajectories exhibit smooth decay (single critical event) or stepwise reduction post-patch and exploit disclosure.

5. CVSS Speech-to-Speech Corpus: Multilingual Parallel Speech Resources

Distinct from vulnerability datasets, the “CVSS Corpus” (Jia et al., 2022) also names a large-scale parallel speech-to-speech translation dataset for 21 source languages into English. Derived from Common Voice and CoVoST 2 corpora, CVSS encompasses ~1,153 hours of source speech and synthetic English targets.

It is released in two variants: CVSS-C (canonical English speaker, high speech naturalness) and CVSS-T (zero-shot voice cloning to preserve source speaker identity). The corpus comprises over 800 hours of synthetic English speech and 650,000+ train pairs, with rigorous normalization (WFST for non-standard tokens) and advanced TTS pipelines (PnG NAT, WaveRNN). Baseline modeling with Translatotron/Translatotron2 and cascade ASR→MT→TTS configurations delivers BLEU scores up to 12.7 for the best pre-trained systems. MOS ratings demonstrate tradeoffs: CVSS-C exhibits higher naturalness (≈ 4.6), while CVSS-T achieves superior speaker similarity at the cost of reduced clarity.

6. Access, Practical Utility, and Recommendations

Both vulnerability and speech CVSS corpora are designed for public research access, typically via GitHub under open-source licensing or by requesting authors. The vulnerability text corpus (CVSS-OSINT) is available as JSONL/CSV with cve_id, source_domain, url, and text fields (Kuehn et al., 2022). The speech corpus (CVSS) comprises parallel audio–text datasets with explicit splits for training, development, and testing (Jia et al., 2022).

Corpus-driven modeling recommendations include integration of advanced metric assignment schemes, time-dependent scoring, and adoption of ML/NLP pipelines for rapid, high-fidelity vector prediction. The speech corpus supports S2ST system benchmarking with both canonical and personalized translation scenarios, providing rich testbeds for transfer learning and end-to-end architectures.

7. Limitations and Future Corpus Directions

Observed limitations include corpus bias towards a restricted set of vulnerability patterns (≥60% described by ≲10 vectors), static base-score pathologies not remediable without fundamental framework modifications, rate-limiting in the crawling of certain domains (e.g., WPScan at 4% retrieval), and lag in metric updates reflecting real-world exploit and patch events. Speech corpus limitations pertain to naturalness and speaker-similarity tradeoffs, as well as capacity and transfer challenges for low-resource directions.

A plausible implication is that future corpus construction and annotation standards must target expanded metric diversity, dynamic score recalibration, context-sensitive impact modeling, and deeper inclusion of OSINT sources. For speech translation, further technical advances in TTS/VC and multimodal pre-training may narrow remaining naturalness and fidelity gaps.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CVSS Corpus.