Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM (2507.20994v1)

Published 28 Jul 2025 in cs.CV and cs.AI

Abstract: Large visual-LLMs (LVLMs) integrate aligned LLMs with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model's parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs' ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module's textual "safety layers" in visual inputs, thereby effectively extending text-based safety to the visual modality.

Collections

Summary

The paper introduces security tensors that transfer the safety mechanisms of text to visual inputs without modifying the model parameters.
It demonstrates significant improvements in rejecting harmful visual content while maintaining performance on benign tasks.
The method achieves robust generalization across seen and unseen malicious categories with minimal trade-offs in model performance.

This paper introduces "security tensors," a novel approach to enhance the safety of Large Vision-LLMs (LVLMs) against harmful visual inputs by extending text-aligned safety mechanisms to the visual modality. The method involves training input vectors, referred to as security tensors, that can be applied during inference to either the textual or visual modality, effectively transferring textual safety alignment to visual processing without modifying the model’s parameters. The paper demonstrates that security tensors significantly improve LVLMs' ability to reject diverse harmful visual inputs while maintaining performance on benign tasks.

Background and Motivation

LVLMs, which integrate LLMs with visual modules, are vulnerable to harmful image inputs because the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities. This discrepancy arises because the visual understanding stage occurs after the language module's training, causing inconsistencies in encoding across modalities that can be exploited by malicious visual inputs. Existing methods to address this issue often involve fine-tuning model parameters using additional visual safety datasets, leading to disjointed security architectures and high computational costs. The paper posits that a more effective approach would leverage the language module’s intrinsic capacity to distinguish malicious content by directly perturbing input representations to align harmful visual patterns with textual safety-aligned semantic space.

Methodology: Security Tensors

The core of the proposed method is the introduction of security tensors, which are trainable input perturbations injected into either the textual or visual modalities. These tensors are designed to activate the language module’s pre-trained textual safety mechanisms in response to visual inputs. The security tensors are optimized using a curated dataset comprising three subsets:

Safety Activation (SA) Set: Malicious image-text pairs paired with rejection outputs train the model to associate harmful visual patterns with the language module’s pre-trained safety mechanisms.
Text Contrast Benign (TCB) Set: Benign image-text queries with syntactic/structural similarity to malicious inputs ensure that the tensors learn visual safety cues rather than exploiting textual artifacts.
General Benign (GB) Set: General benign image-text pairs maintain the model’s performance on harmless tasks, preventing over-restriction.
Figure 1: Examples of image-text query for SA, TCB, and GB ets, highlighting the intentional textual similarity between the TCB and SA sets.

Implementation Details

Textual security tensors, denoted as $\delta_t$ , are learnable vectors in the embedding space of LVLMs and are inserted between the image and text token embeddings. The perturbed embedding sequence $\tilde{\mathbf{E}$ is formulated as:

$\tilde{\mathbf{E} = [\mathbf{E}_{\text{img}; \delta_t; \mathbf{E}_{\text{text}],$

Visual security tensors, denoted as $\delta_v$ , are applied to the preprocessed image space. By operating on the preprocessed image rather than the raw input image, $\delta_v$ can adapt to arbitrary input resolutions.

Experimental Evaluation

The authors conducted experiments to evaluate the effectiveness and generalization ability of the security tensors. The models were tested on both "seen-category" samples (harmful image categories present in the training set) and "unseen-category" samples (novel harmful categories absent in training). The Harmless Rate (HR), defined as the proportion of queries that the LVLM successfully refuses to answer, was used as the primary metric for security performance. Benignness was assessed using the False Rejection Rate (FRR) and the MM-Vet score.

The experimental results indicate that both textual and visual security tensors significantly enhance the visual safety of LVLMs. The effectiveness of the security tensors correlates positively with the inherent safety of the language module in each LVLM. Both $\delta_v$ and $\delta_t$ improve the model's safety performance on harmful image categories in the training dataset and generalize well to unseen malicious categories. In terms of benignness, the introduction of $\delta_v$ and $\delta_t$ causes minimal performance degradation.

Figure 2: Training loss curves for $\delta_v$ and $\delta_t$ across LVLMs, showing visual and textual tensor training loss values across epochs.

Internal Analysis: Safety Layer Activation

The paper includes an internal analysis of the LVLM's hidden-layer representations to understand how security tensors achieve cross-modal safety activation. The analysis identifies "safety layers" within the LVLM’s language module that play a crucial role in distinguishing malicious textual content from benign content. The introduction of either visual or textual security tensors reactivates these safety layers during harmful image-text processing, aligning their activation patterns with those seen in text-based safety scenarios. This suggests that security tensors successfully extend the language module’s pre-trained textual safety mechanisms to handle visual content.

Figure 3: (n=10)

Figure 4: (n=10)

Figure 5: Examples of adversarial image-text query examples for SA and new TCB test set, highlighting the intentionally designed textual similarity between the two sets.

Ablation Studies

Ablation studies were conducted to assess the importance of the Text Contrast Benign (TCB) set. The results indicated that training security tensors without the TCB set led to a significant drop in the harmless rate and increased false rejection rate, suggesting that the TCB set is crucial for guiding the security tensors to attend to visual information.

Conclusion

The paper demonstrates that text-aligned safety mechanisms in LVLMs can be effectively extended to the visual modality via input-level security tensors. The security tensors act as a bridge between modalities, enabling LVLMs to generalize safety behavior from text to vision while preserving performance on benign inputs. This approach not only improves robustness against visual threats but also provides insights into cross-modal safety alignment, offering a practical pathway for improving safety in multimodal models.