GuardRank: Automated Safety Assessment

Updated 19 November 2025

GuardRank is a lightweight, plug-and-play evaluator designed to emulate expert human judgment by assessing key safety dimensions in MLLM image-text responses.
It employs a dual-stage evaluation pipeline with LLaMA-2 for multi-class tasks and RoBERTa-large for binary tasks, ensuring robust and scalable assessments.
GuardRank outperforms GPT-4 by nearly doubling accuracy on safety dimensions through comprehensive red-teaming on an adversarial dataset.

GuardRank is a lightweight, plug-and-play automated evaluator designed to emulate expert human judgement in the safety assessment of open-ended image-text responses generated by Multimodal LLMs (MLLMs). Developed within the MLLMGuard framework, GuardRank systematically addresses five critical safety dimensions—Privacy, Bias, Toxicity, Truthfulness, and Legality—offering scalable and rigorous safety monitoring for MLLMs. It achieves state-of-the-art evaluation accuracy compared with existing LLM-based evaluators such as GPT-4, driven by fine-tuning on an adversarial image-text dataset assembled via advanced red-teaming techniques (Gu et al., 2024).

1. Safety Dimensions and Definition

GuardRank incorporates five domains in alignment with the MLLMGuard benchmark:

Privacy: Detection of personal, trade-secret, and state-secret information leakage.
Bias: Assessment of stereotype propagation, prejudice, and discrimination.
Toxicity: Identification of hate speech, violent or pornographic content.
Truthfulness: Evaluation for hallucinations and robustness to position swapping or noise.
Legality: Analysis of guidance related to personal safety and public security.

Each dimension comprises multiple subtasks reflecting real-world adversarial threats, as codified in the dataset’s construction and annotation methodology.

2. Architecture and Workflow

The GuardRank evaluation pipeline comprises distinct stages optimized for both multi-class and binary tasks:

Input Preparation: The user prompt, consisting of an image description and text query, is concatenated with the corresponding MLLM response into a standardized text template.
Dimension Routing:
- Four-class tasks (Privacy, Bias, Toxicity, Legality): Input is analyzed by a fine-tuned LLaMA-2 classifier.
- Binary task (Truthfulness—hallucination): Input is processed by a fine-tuned RoBERTa-large classifier.
Scoring Head: Model hidden states are mapped via a feed-forward layer to discrete integer score labels (0–3 for four-class tasks; 0/1 for binary hallucination).
Output Aggregation: Predicted scores undergo linear smoothing onto a 0–1 scale, aggregated to derive evaluation metrics per safety dimension.

3. Evaluation Metrics

GuardRank automates the human scoring procedure defined for MLLMGuard, providing standardized quantitative metrics for safety assessment:

Attack Success Degree (ASD): Lower values denote enhanced safety.

$ASD_{i} = \frac{1}{|R_{i}|}\sum_{(p,r)\in R_{i}} \mathrm{Smooth}\bigl(\mathrm{Score}(p,r)\bigr)\;\times 100$

$R_{i}$ : Set of prompt–response pairs for dimension $i$ .
$\mathrm{Score}(\cdot)\in \{0,1,2,3\}$ : Integer label assigned.
$\mathrm{Smooth}(\cdot)$ $Smooth (\cdot)$ : Linear mapping from $\{0,1,2,3\}$ ${0, 1, 2, 3}$ to $[0,1]$ $[0, 1]$ .
- Perfect Answer Rate (PAR): Higher values denote improved safety.

$PAR_{i} = \frac{1}{|R_{i}|}\sum_{(p,r)\in R_{i}}\mathbf{1}\{\mathrm{Score}(p,r)=0\}\;\times 100\%$

4. Training Data and Red-Teaming Methodology

GuardRank is trained on the human-annotated MLLMGuard dataset, encompassing 2,282 adversarial image-text examples. Sample construction employs both image-based and text-based attacks:

Image-based attacks: Facial stereotype, malicious labeling, non-existent queries, position swapping, noise injection, and harmful scenario generation.
Text-based attacks: Disguise, reverse induction, unsafe inquiry, and indirect task misalignment.

Out-of-distribution validation utilizes outputs from XComposer2-VL for validation and LLaVA-v1.5 / Qwen-VL-Chat for test, enhancing generalization assessment.

5. Comparative Performance Analysis

GuardRank surpasses GPT-4, both zero-shot and with in-context learning, in accuracy across all safety dimensions. Results on held-out test data are summarized below:

Evaluator	Privacy	Bias	Toxicity	Halluc.	Legality	Avg.
GPT-4 (zero-shot)	27.9	30.6	12.1	38.9	37.5	29.4
GPT-4 (1-shot)	31.4	30.5	35.9	61.9	54.1	42.8
GuardRank (ours)	68.3	70.3	79.8	97.2	69.8	77.1

GuardRank nearly doubles the best in-context learning accuracy of GPT-4 (77.1% vs. 42.8%), with dimension-wise improvements ranging from 20 to 80 percentage points (Gu et al., 2024).

6. Component Analysis and Ablation Insights

Empirical findings indicate:

Backbone Selection: LLaMA-2 with LoRA fine-tuning achieves approximately 67–68% test accuracy on multi-class tasks, outperforming single backbone alternatives such as RoBERTa-large (~64%).
Hallucination Head: RoBERTa-large demonstrably exceeds the accuracy of ViT-LLM hybrids and BERT-base variants for binary hallucination detection (97.2% vs. 62–70%).

This suggests that a robust open LLM backbone is essential for reliable multi-class safety assessment, while a targeted binary classifier optimally detects hallucinations.

7. Limitations and Prospective Developments

GuardRank’s design, while effective, manifests certain constraints:

The token cap (128 tokens) can limit processing of extensive multi-turn dialogues.
Its single-turn evaluation paradigm precludes nuanced inter-turn reasoning.
Lightweight LLaMA-2 and RoBERTa-large backbones may be upgraded to multimodal or retrieval-augmented architectures.
The curated human adversarial dataset, while comprehensive, poses scaling challenges; semi-automated red-teaming may extend coverage.
Ongoing evolution of safety domains (e.g., privacy regulation compliance, dynamic adversarial manipulation) remains imperative as threat landscapes shift.

A plausible implication is that future developments may integrate broader contextual analysis and more scalable annotation mechanisms to enhance coverage and robustness.

PDF Markdown Chat (Pro)

References (1)

MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GuardRank.