DefenSee: Risk Analysis & Defense Metrics

Updated 8 December 2025

DefenSee is a framework that combines a defensibility index for critical systems with a multi-view, black-box pipeline for safeguarding multi-modal LLMs.
It employs rigorous mathematical formulations to quantify defense investments, measuring cost-effectiveness and guiding resource allocation.
The multi-modal pipeline leverages image analysis, OCR transcription, and cross-modal consistency checks to reduce jailbreak attack success rates to below 2%.

DefenSee refers to a family of modern methodologies and systems for risk analysis and defense, spanning domains from critical infrastructure to multi-modal machine learning models. It is most prominently associated with two contemporary interpretations: (1) as an explicit index for quantifying the defensibility of critical asset systems (Bier et al., 2019), and (2) as a black-box multi-view pipeline for defending multi-modal LLMs (MLLMs) against jailbreak attacks (Wang et al., 1 Dec 2025). The following sections provide a comprehensive treatment of both the classical and the recent multi-modal DefenSee formulations.

1. Conceptual Foundations and Terminology

DefenSee, as an index for critical systems, characterizes the potential efficacy of defensive investments—that is, it measures the degree to which marginal increases in defensive resources mitigate the damage from attacks or disruptions. Where traditional indices such as vulnerability or resilience quantify, respectively, system susceptibility and recovery capacity, the defensibility index answers the question: “To what extent would an incremental defense reduce attack-induced loss per unit of effort?” This measure is defined to be dimensionless, bounded between zero (no benefit from defense) and one (maximum theoretical cost-effectiveness) (Bier et al., 2019).

In contrast, the modern DefenSee system for multi-modal LLM defense is an automated, black-box, multi-stage pipeline. This system targets the “vulnerability gap” that emerges when models are extended beyond text inputs to fuse images and language, which standard text-centric alignment cannot close (Wang et al., 1 Dec 2025).

2. Mathematical Formalism and Analytical Properties

2.1. Defensibility Index for Critical Systems

For a system of initial (pre-attack, pre-defense) value $U = V(0,0)$ , attacker strength $a \geq 0$ , defender investment $b \geq 0$ , and post-attack value $V(a,b)$ , defensibility is

$D(a, b) = \frac{V(a, b) - V(a, 0)}{U}$

This metric quantifies the fractional improvement in residual value due solely to defense $b$ .

For discrete asset systems (assets $i = 1, ..., n$ with values $V_i$ ), and under mutually binary defense/attack (protected assets survive, attackers optimally destroy top undefended assets), the corresponding expressions are:

Post-attack value: $V(a, b) = U - \sum_{i=b+1}^{b+a} V_i$
Defensibility: $D(a, b) = \frac{\sum_{i=1}^b V_i - \sum_{i=b+1}^{b+a} V_i}{U}$

Key properties include:

Monotonicity: $D(a, b)$ increases with both $a$ and $b$ .
Marginal trade-off: $\partial D/\partial b$ exceeds $\partial D/\partial a$ when $b \leq a$ .
Distribution sensitivity: For positively skewed (convex) asset values, small investments yield high defensibility. For concave distributions, defensibility rises only with large $b$ .

DefenSee implements a multi-stage defense (Wang et al., 1 Dec 2025):

Image Content Analysis: For input image $I$ $I$ , create two variants:
- $I_h = T_1(I)$ : ScoreCam pseudo-color heatmap version amplifying salient regions.
- $I_i = T_2(I)$ : U²-Net foreground-only mask.
- $T_{ocr} = T_3(I)$ : BLIP-2 based OCR text extraction.
Variant Transcription: Pass variants through a black-box vision-LLM (VLM, e.g. GPT-4o) to obtain textual descriptions $\mathcal{T} = \{T_h, T_i, T_{ocr}\}$ .
Cross-Modal Consistency Check: Map images and text into a shared $d$ -dimensional embedding space ( $f_{img}$ , $f_{text}$ ; e.g., CLIP). Cosine similarity

$S(I, T) = \frac{\langle f_{img}(I), f_{text}(T) \rangle}{\|f_{img}(I)\|\|f_{text}(T)\|}$

If $S(I, T) < \tau$ (typically $\tau \approx 0.7$ ), the pair is flagged as inconsistent or suspicious.

Reference Set Gating: To avoid over-defense, similarity to precomputed benign ( $\mathcal Q_{\text{ben}}$ ) and malicious ( $\mathcal Q_{\text{mal}}$ ) sets controls pipeline activation.
Decision Aggregation: For each view $j$ , refusal votes $s_j \in \{0,1\}$ (1 if model refuses) are aggregated: $\hat y = \begin{cases} \text{Refuse}, & \sum_j s_j \geq 2 \ \text{Answer}, & \text{otherwise} \end{cases}$

3. Defense Workflow, Implementation, and Complexity

The DefenSee LLM pipeline operates in real time, treating the MLLM as a true black box. Major steps:

Preprocessing: Variant creation ( $\mathcal O(HWC)$ per variant).
Transcription: Black-box VLM invocation ( $\mathcal O(P)$ tokens).
Embedding and Similarity: CLIP embedding ( $\mathcal O(d)$ per pair).
Thresholding and Activation: Compare with $\mathcal Q_{\text{mal}}$ , $\mathcal Q_{\text{ben}}$ .
CoT-style Prompt Aggregation: Multi-view defense prompt fusion and majority-vote refusal aggregation.

Measured end-to-end latency is ∼6.68 s using black-box VLMs and CLIP, notably higher than AdaShield-S/A (∼5.23–5.27 s) but significantly lower than ECSO ( > 15 s).

4. Empirical Evaluation and Comparative Performance

Experiments on MM-SafetyBench (1,680 queries, 5,040 image–text pairs, 13 unsafe categories) show:

Attack Success Rate (ASR): For MiniGPT4,
- No defense: 46–48%
- DefenSee: $\leq$ 1.70%
- AdaShield-S/A: 7–11%
- ECSO: 41–45%
False Rejection Rate (FRR) (MM-Vet benign queries, MiniGPT4): DefenSee achieves 12.2%, outperforming AdaShield-S (19.0%) and AdaShield-A (13.1%) at similar ASR levels.

The trade-off is explicit: DefenSee reduces ASR by over an order of magnitude while maintaining competitive FRR. A grid search over threshold pairs $(\tau_{\text{mal}},\tau_{\text{ben}})$ yields minimal combined risk when $\tau_{\text{mal}}=0.72,\ \tau_{\text{ben}}=0.16$ .

5. Practical Considerations, Limitations, and Future Work

The core DefenSee MLLM methodology is lightweight, black-box, and training-free. However:

It depends on external VLMs (e.g., GPT-4o) for high-quality transcription, which may introduce availability/cost dependencies and susceptibility to model drift.
Current evaluations focus on structure-based attacks; pure adversarial perturbations require additional study.
The aggregation method (majority voting) and threshold parameters are currently hand-tuned; integration of reinforcement learning or meta-optimization is envisioned.
Overhead is moderate (+1.5 s per query) but substantially less than multi-stage symbolic or rule-based defenses.

Suggested future improvements include extending the pipeline for perturbation-based and multi-turn attacks, automating prompt template adaptation, and introducing end-to-end learnable thresholds.

6. Positioning within the Broader Landscape

The original DefenSee index fills a gap neglected by indices such as resilience or vulnerability: namely, the cost-effectiveness of defense itself (Bier et al., 2019). As a metric, it enables meaningful comparison between disparate systems and justifies resource allocation by marginal improvement, not just absolute risk.

In the security of neural systems, the DefenSee pipeline is distinguished from other baselines (e.g., AdaShield-S/A, ECSO) by its multi-view, black-box approach and its emphasis on robust, consensus-based refusal. In comparative context, Sentra-Guard represents an alternative, modular, transparent, and multilingual-adapted defense realizing 0.004% ASR and 99.996% detection over HarmBench-28K (Hasan et al., 26 Oct 2025). Integrating semantics-driven retrieval, transformer classifiers, and human-in-the-loop learning, Sentra-Guard defines current state-of-the-art standard operating procedures, which DefenSee can complement or combine with for complete, production-grade safety pipelines.

7. Summary Table: DefenSee in Two Contexts

Domain	Core Principle	Key Metric / Outcome
Critical Systems (risk index)	Marginal value gain from incremental defense	Dimensionless defensibility index $D$ ≤ 1 (Bier et al., 2019)
Multi-modal LLM Jailbreak Defense	Multi-view, variant-based black-box detection	ASR ≤ 1.70%, FRR ≈ 12% (Wang et al., 1 Dec 2025)

DefenSee thus serves as both a formal analytical index guiding critical infrastructure defense strategy, and as a practical, high-impact AI safety pipeline for advancing the robustness of next-generation multi-modal LLMs.

Markdown Upgrade to Chat

References (3)

Risk analysis beyond vulnerability and resilience - characterizing the defensibility of critical systems (2019)

DefenSee: Dissecting Threat from Sight and Text - A Multi-View Defensive Pipeline for Multi-modal Jailbreaks (2025)

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DefenSee.