DefenSee: Risk Analysis & Defense Metrics
- DefenSee is a framework that combines a defensibility index for critical systems with a multi-view, black-box pipeline for safeguarding multi-modal LLMs.
- It employs rigorous mathematical formulations to quantify defense investments, measuring cost-effectiveness and guiding resource allocation.
- The multi-modal pipeline leverages image analysis, OCR transcription, and cross-modal consistency checks to reduce jailbreak attack success rates to below 2%.
DefenSee refers to a family of modern methodologies and systems for risk analysis and defense, spanning domains from critical infrastructure to multi-modal machine learning models. It is most prominently associated with two contemporary interpretations: (1) as an explicit index for quantifying the defensibility of critical asset systems (Bier et al., 2019), and (2) as a black-box multi-view pipeline for defending multi-modal LLMs (MLLMs) against jailbreak attacks (Wang et al., 1 Dec 2025). The following sections provide a comprehensive treatment of both the classical and the recent multi-modal DefenSee formulations.
1. Conceptual Foundations and Terminology
DefenSee, as an index for critical systems, characterizes the potential efficacy of defensive investments—that is, it measures the degree to which marginal increases in defensive resources mitigate the damage from attacks or disruptions. Where traditional indices such as vulnerability or resilience quantify, respectively, system susceptibility and recovery capacity, the defensibility index answers the question: “To what extent would an incremental defense reduce attack-induced loss per unit of effort?” This measure is defined to be dimensionless, bounded between zero (no benefit from defense) and one (maximum theoretical cost-effectiveness) (Bier et al., 2019).
In contrast, the modern DefenSee system for multi-modal LLM defense is an automated, black-box, multi-stage pipeline. This system targets the “vulnerability gap” that emerges when models are extended beyond text inputs to fuse images and language, which standard text-centric alignment cannot close (Wang et al., 1 Dec 2025).
2. Mathematical Formalism and Analytical Properties
2.1. Defensibility Index for Critical Systems
For a system of initial (pre-attack, pre-defense) value , attacker strength , defender investment , and post-attack value , defensibility is
This metric quantifies the fractional improvement in residual value due solely to defense .
For discrete asset systems (assets with values ), and under mutually binary defense/attack (protected assets survive, attackers optimally destroy top undefended assets), the corresponding expressions are:
- Post-attack value:
- Defensibility:
Key properties include:
- Monotonicity: increases with both and .
- Marginal trade-off: exceeds when .
- Distribution sensitivity: For positively skewed (convex) asset values, small investments yield high defensibility. For concave distributions, defensibility rises only with large .
2.2. DefenSee Pipeline for Multi-modal LLMs
DefenSee implements a multi-stage defense (Wang et al., 1 Dec 2025):
- Image Content Analysis: For input image , create two variants:
- Variant Transcription: Pass variants through a black-box vision-LLM (VLM, e.g. GPT-4o) to obtain textual descriptions .
- Cross-Modal Consistency Check: Map images and text into a shared -dimensional embedding space (, ; e.g., CLIP). Cosine similarity
If (typically ), the pair is flagged as inconsistent or suspicious.
- Reference Set Gating: To avoid over-defense, similarity to precomputed benign () and malicious () sets controls pipeline activation.
- Decision Aggregation: For each view , refusal votes (1 if model refuses) are aggregated:
3. Defense Workflow, Implementation, and Complexity
The DefenSee LLM pipeline operates in real time, treating the MLLM as a true black box. Major steps:
- Preprocessing: Variant creation ( per variant).
- Transcription: Black-box VLM invocation ( tokens).
- Embedding and Similarity: CLIP embedding ( per pair).
- Thresholding and Activation: Compare with , .
- CoT-style Prompt Aggregation: Multi-view defense prompt fusion and majority-vote refusal aggregation.
Measured end-to-end latency is ∼6.68 s using black-box VLMs and CLIP, notably higher than AdaShield-S/A (∼5.23–5.27 s) but significantly lower than ECSO ( > 15 s).
4. Empirical Evaluation and Comparative Performance
Experiments on MM-SafetyBench (1,680 queries, 5,040 image–text pairs, 13 unsafe categories) show:
- Attack Success Rate (ASR): For MiniGPT4,
- No defense: 46–48%
- DefenSee: 1.70%
- AdaShield-S/A: 7–11%
- ECSO: 41–45%
- False Rejection Rate (FRR) (MM-Vet benign queries, MiniGPT4): DefenSee achieves 12.2%, outperforming AdaShield-S (19.0%) and AdaShield-A (13.1%) at similar ASR levels.
The trade-off is explicit: DefenSee reduces ASR by over an order of magnitude while maintaining competitive FRR. A grid search over threshold pairs yields minimal combined risk when .
5. Practical Considerations, Limitations, and Future Work
The core DefenSee MLLM methodology is lightweight, black-box, and training-free. However:
- It depends on external VLMs (e.g., GPT-4o) for high-quality transcription, which may introduce availability/cost dependencies and susceptibility to model drift.
- Current evaluations focus on structure-based attacks; pure adversarial perturbations require additional study.
- The aggregation method (majority voting) and threshold parameters are currently hand-tuned; integration of reinforcement learning or meta-optimization is envisioned.
- Overhead is moderate (+1.5 s per query) but substantially less than multi-stage symbolic or rule-based defenses.
Suggested future improvements include extending the pipeline for perturbation-based and multi-turn attacks, automating prompt template adaptation, and introducing end-to-end learnable thresholds.
6. Positioning within the Broader Landscape
The original DefenSee index fills a gap neglected by indices such as resilience or vulnerability: namely, the cost-effectiveness of defense itself (Bier et al., 2019). As a metric, it enables meaningful comparison between disparate systems and justifies resource allocation by marginal improvement, not just absolute risk.
In the security of neural systems, the DefenSee pipeline is distinguished from other baselines (e.g., AdaShield-S/A, ECSO) by its multi-view, black-box approach and its emphasis on robust, consensus-based refusal. In comparative context, Sentra-Guard represents an alternative, modular, transparent, and multilingual-adapted defense realizing 0.004% ASR and 99.996% detection over HarmBench-28K (Hasan et al., 26 Oct 2025). Integrating semantics-driven retrieval, transformer classifiers, and human-in-the-loop learning, Sentra-Guard defines current state-of-the-art standard operating procedures, which DefenSee can complement or combine with for complete, production-grade safety pipelines.
7. Summary Table: DefenSee in Two Contexts
| Domain | Core Principle | Key Metric / Outcome |
|---|---|---|
| Critical Systems (risk index) | Marginal value gain from incremental defense | Dimensionless defensibility index ≤ 1 (Bier et al., 2019) |
| Multi-modal LLM Jailbreak Defense | Multi-view, variant-based black-box detection | ASR ≤ 1.70%, FRR ≈ 12% (Wang et al., 1 Dec 2025) |
DefenSee thus serves as both a formal analytical index guiding critical infrastructure defense strategy, and as a practical, high-impact AI safety pipeline for advancing the robustness of next-generation multi-modal LLMs.