Vulnerability Score: Metrics and Methodologies

Updated 24 May 2026

Vulnerability scores are quantitative measures that assess the severity and exploitability of software flaws using structured frameworks like CVSS and adaptive ML methods.
They integrate metrics such as attack vector, privileges, and impact factors to compute scores through both traditional formulas and data-driven ensemble models.
Recent advancements combine exploit prediction, supply chain risk, and contextual analysis to enhance the accuracy of vulnerability management and triage.

A vulnerability score is a quantitative indicator representing the severity, exploitability, or risk profile of a software flaw or system weakness, usually for the purpose of triage, risk management, or automated remediation workflows. The term encompasses both established frameworks—such as the Common Vulnerability Scoring System (CVSS)—and advanced, data-driven, context-adaptive scoring schemes. Recent research explores not only the computational machinery underpinning these scores, but also operational metrics for automation, exploit prediction, risk-weighting, and empirical assessment in high-volume vulnerability management scenarios.

1. Formal Structure and Calculation of CVSS Vulnerability Score

The canonical vulnerability score is the CVSS Base Score, defined by a parameterized aggregation of eight atomic metrics:

Exploitability metrics: Attack Vector (AV), Attack Complexity (AC), Privileges Required (PR), User Interaction (UI)
Scope: Unchanged (U) or Changed (C)
Impact metrics: Confidentiality (C), Integrity (I), Availability (A)

These categorical metrics map to numeric weights as prescribed in the FIRST CVSS v3.1/v3.0 specification:

Metric	Value(s)	Numeric Weight
AV	N/.85, A/.62, L/.55, P/.20	0.85, 0.62...
AC	L/.77, H/.44	0.77, 0.44
PR (S=U)	N/.85, L/.62, H/.27	0.85, 0.62...
PR (S=C)	N/.85, L/.68, H/.5	0.85, 0.68...
UI	N/.85, R/.62	0.85, 0.62
S	Unchanged (U), Changed (C)	-
C, I, A	H/.56, L/.22, N/.00	0.56, 0.22...

The base score is computed as:

$\mathrm{Exploitability} = 8.22 \times AV \times AC \times PR \times UI$
$\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$
$\mathrm{Impact} = \begin{cases} 6.42 \times \mathrm{ISS}, & S=U \ 7.52\times(\mathrm{ISS}-0.029) - 3.25\times(\mathrm{ISS}-0.02)^{15}, & S=C \end{cases}$
$\mathrm{BaseScore} = \begin{cases} 0, & \mathrm{Impact} \leq 0 \ \mathrm{RoundUp}_{0.1}(\min(\mathrm{Impact} + \mathrm{Exploitability}, 10)), & S=U \ \mathrm{RoundUp}_{0.1}(\min(1.08 \cdot (\mathrm{Impact} + \mathrm{Exploitability}), 10)), & S=C \end{cases}$

This yields a real-valued score in [0,10], partitioned into qualitative bands: None, Low, Medium, High, Critical (Gueye et al., 2021, Jafarikhah et al., 7 Dec 2025).

2. Automated Scoring and Natural Language Processing

The increasing scale of vulnerability disclosures necessitates automation. State-of-the-art methodologies employ ML and LLMs:

CVSS-BERT: An ensemble of eight BERT-based classifiers, each per base metric. When provided a vulnerability description, each model predicts the categorical value for its target metric. The predicted vector is scored per the official formula. CVSS-BERT attains accuracies ranging from 0.83 to 0.96 per metric and a mean absolute error (MAE) of 0.73 on base scores; exact matches occur in over 55% of cases (Shahid et al., 2021).
LLM approaches: Recent work shows that models such as GPT-5 and Gemini can be prompt-engineered (zero-temperature, few-shot) to extract base metrics directly from English CVE descriptions. Ensemble classifiers built over LLM outputs yield additional gains, particularly in metrics with high class imbalance (e.g., Attack Vector, User Interaction), though limitations persist for impact-centric metrics due to lack of contextual detail in descriptions (Jafarikhah et al., 7 Dec 2025).
Error sources: Systematic misclassification is linked to missing context (e.g., required privileges), ambiguous text, and model overgeneralization. Structured tags (CWE, CPE) provide negligible accuracy gains unless fit for adversarial matching.

3. Vulnerability Scores Beyond Severity: Predictive and Contextual Augmentations

While the CVSS base score reflects intrinsic technical severity, recent scoring systems introduce dimensions of exploit likelihood, systemic context, and risk relevance:

Exploit Prediction Scoring System (EPSS): Assigns each CVE a probability $s_i = \Pr(\text{exploited in 30 days} \mid X_i)$ , where $X_i$ is a feature vector (e.g., CVSS, exploit code, social data, reference counts, product/vendor indicators). EPSS leverages XGBoost with ~1,500 features; EPSS v3 achieves AUC $_{\mathrm{PR}}=0.7795$ , sharply outperforming CVSS base scores for predicting real-world exploits (Jacobs et al., 2023).
Key Risk Indicator (KRI): Integrates EPSS (threat), CVSS-weighted severity (impact), and CWE class prevalence (exposure) into an expected-loss product:

$\mathrm{KRI}(v) = \mathrm{EPSS}(v) \times \mathrm{CVSS}_{\text{wt}}(v) \times \mathrm{CWE}_{\text{wt}}(v)$

This reorders vulnerabilities not only by exploit likelihood, but also by prospective loss magnitude and systemic frequency; it captures 92% of impact-weighted remediation value at budget-limited triage ( $k=500$ ) versus 82% for EPSS alone. KRI outperforms EPSS when the decision-maker’s severity premium exceeds 2 $\times$ (Sherif et al., 12 Mar 2026).

SecScore: Enhances the CVSS Threat group by replacing static exploit maturity with a time-varying, empirically-fit probability $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 0 that an exploit has been published $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 1 weeks post-disclosure. This extends both Temporal and Environmental subscores in a statistically calibrated fashion, sharply increasing timeliness and ordering correlation to empirical exploit emergence (Santana et al., 2024).

4. Supply Chain and Systemic Impact Scoring

Scoring single vulnerabilities inadequately captures propagation risk in software supply chains and complex deployment topologies:

Vulnerability Propagation Scoring System (VPSS): Measures the scope (breadth and depth) of vulnerability impact via a dynamic analysis of dependency graphs spanning entire ecosystems (e.g., Maven Central). VPSS aggregates direct/transitive project and project-version ratios, normalizes path lengths, and scales the result onto a [0,10] risk tier:

$\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 2

where $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 3 (breadth) and $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 4 (depth) are derived from normalized counts and longest propagation chains, respectively. VPSS tracks time evolution to reflect remediation and patch adoption (Ruan et al., 2 Jun 2025).

Contextual Scoring (NCVS, DVCC): Integrates system topology, dependency graphs, and asset criticality. NCVS forms a contextual dependency graph, weighting vulnerabilities by PageRank-like measures across hardware, software, and network layers; vulnerabilities are ranked by aggregated importance to service availability (Zhuang et al., 2016). DVCC modifies CVSS exploitability fields in light of asset-specific reachability and deployed defenses, integrates multi-CVE aggregation, and ultimately computes time-evolving asset vulnerability via a fuzzy cognitive map propagation (Cheimonidis et al., 2024).

5. Empirical Performance and Limitations of Vulnerability Scores

Multiple empirical and statistical studies interrogate the fidelity, discriminative utility, and operational risk-reduction delivered by vulnerability scores:

Discrimination and real-world risk alignment: Case-control studies show that CVSS base score alone has high sensitivity but negligible specificity—patching all CVSS $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 5 6 vulnerabilities reduces empirical exploit risk by only 3.5%, compared to over 60% when incorporating exploit kit presence (Allodi et al., 2013). EPSS and KRI deliver much higher AUC $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 6 against exploitation ground truth; the operational value of CVSS is primarily as an impact-weighting factor, not a stand-alone remediation filter (Sherif et al., 12 Mar 2026).
Inter-score agreement: Comparative analyses of CVSS, EPSS, SSVC, and Exploitability Index reveal only moderate correlation ( $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 7 for CVSS-EPSS) and strikingly low top-N overlap; each system preferentially surfaces different subsets of vulnerabilities (Koscinski et al., 19 Aug 2025).
Information cues and human scoring: The accuracy of manual CVSS scoring is primarily improved by explicit attack vector/type details and degraded by the inclusion of known-threat information in base metric assessment—strongly supporting the minimum-field principle for automating severity extraction (Allodi et al., 2018).
Limitations for nontraditional domains: For adversarial attacks on LLMs, classical vulnerability scores (including CVSS) exhibit low variation and entropy across attack classes, failing to distinguish threat surface or impact—a consequence of rigid value sets and context-insensitive dimensions. New LLM-tailored scoring criteria have been proposed to address trust, privacy, and misinformation risk (Bahar et al., 2024).

6. Domain-Specific and Benchmark-Oriented Vulnerability Metrics

Vulnerability scores also appear in specialized hardware and software testing contexts:

Cache Timing Vulnerability Score (CTVS): Evaluates processor designs against a canonical set of 88 “strong” cache timing side channel patterns representable by automatically generated microbenchmarks. $\mathrm{ISS} = 1 - (1-C)\cdot(1-I)\cdot(1-A)$ 8 Lower CTVS indicates greater resilience; partitioned sub-scores enable fine-grained diagnosis of architectural weaknesses (Deng et al., 2019).
Vulnerability coverage ratio: In software testing, adequacy is measured by how many known NVD vulnerabilities are “covered” by a test suite—operationalized by generating CVSS pattern vectors and mapping exact metric matches (Dass et al., 2020, Dass et al., 2020).

7. Recommendations and Future Directions

Recent literature converges on several recommendations to maximize the reliability and operational value of vulnerability scores at scale:

Information enhancement: CVE descriptions should be systematically enriched with explicit details about required privileges, user interaction, technical flaw type, configuration prerequisites, and exploit references to support both manual and automated scoring (Jafarikhah et al., 7 Dec 2025, Allodi et al., 2018).
Contextualization and data-driven adaptation: Integration of adversary behavior models, system topology, exploit-likelihood metrics (e.g., EPSS), and risk-weighted composite scores (e.g., KRI) is essential to align prioritization with tangible risk reduction (Sherif et al., 12 Mar 2026, McCoy et al., 2024).
Automation: The combination of high-performing ML/LLM models, instruction/prompt tuning, and pipeline integration into ticketing and SIEM platforms enable near-human-speed triage with documented accuracy—especially on well-structured exploitability metrics.
Foundation for LLM- and supply-chain–centric metrics: As attack surfaces diversify (e.g., LLM adversarial input, software supply chain propagation), vulnerability scoring must evolve to support greater dimensionality, dynamic factors, empirical calibration, and context-awareness.

Continued empirical validation, open-science benchmarking, and transparent metric documentation are fundamental to the defensibility and progress of vulnerability scoring methodologies.