Papers
Topics
Authors
Recent
Search
2000 character limit reached

MJ-Bench: Multimodal Model Evaluation

Updated 1 February 2026
  • MJ-Bench is a family of systematic benchmarks designed to assess multimodal and language model reward functions, highlighting alignment, safety, quality, and bias concerns.
  • The suite comprises MiJaBench for minority jailbreaking audits, MJ-Bench for text-to-image evaluation, and MJ-Bench-Video for fine-grained video model assessments.
  • Rigorous protocols using metrics like Defense Rate and preference accuracy uncover group-specific vulnerabilities and inform improvements in model alignment.

MJ-Bench is a family of systematic benchmarks designed for rigorous evaluation of multimodal and LLM reward functions. These datasets target model alignment, safety, quality, and bias by deploying preference-based or adversarial tasks tailored to reveal model weaknesses overlooked by traditional aggregate metrics. Notable instantiations include MiJaBench for minority jailbreaking audits in LLMs, MJ-Bench for text-to-image reward model evaluation, and MJ-Bench-Video for fine-grained video generation assessment. This article summarizes their construction, evaluation protocols, findings, and implications for model alignment.

1. Benchmark Scope and Motivation

The MJ-Bench suite addresses critical gaps in current model evaluation. Existing benchmarks typically collapse diverse safety or preference objectives into scalar scores, masking vulnerabilities and blind spots. MiJaBench (Brito et al., 7 Jan 2026) specializes in adversarial hate-speech jailbreaks across 16 minority demographics to expose selective safety in LLMs. MJ-Bench (&&&1&&&) comprehensively assesses reward models (“judges”) for text-to-image generators across alignment, safety, image quality, and bias. MJ-Bench-Video (Tong et al., 3 Feb 2025) extends the framework for text-to-video models, incorporating 28 fine-grained criteria across five high-level dimensions. In all cases, the overarching goal is to diagnose failure modes, assess group-specific or scenario-sensitive model behavior, and guide improvements for robust alignment.

2. Dataset Construction and Structure

MiJaBench

MiJaBench consists of 44,000 bilingual (English/Portuguese) prompts constructed from hate-speech seeds sourced from ToxiGen and ToxSyn, distributed across 16 minority groups. Each group is represented by 2,000 toxic samples, stratified into categories such as race, ethnicity, gender, disability, and nationality. Prompts are embedded in 21 contextual scenarios, with deliberate use of adversarial rewriting methods to maximize jailbreaking pressure and eliminate non-adversarial refusals (Brito et al., 7 Jan 2026).

MJ-Bench (Text-to-Image)

MJ-Bench assembles a triplet-based preference dataset, where each data point comprises a textual instruction II, a preferred image MpM_p, and a rejected image MnM_n. Subcategories span four perspectives: alignment (object, attribute, action, count, spatial), safety (toxicity, NSFW variants), image quality (blur, distortion), and bias (occupation, education scenarios with demographic variation). Sources include Pick-a-pic, HPDv2, ImageRewardDB, and human-verified synthetic edits (Chen et al., 2024).

MJ-Bench-Video

MJ-Bench-Video contains 5,421 curated text–video pairs (comprising 10,842 videos), extracted and filtered from three sources: human-preference video annotation (Safesora), image-to-video conversion (I2V, including MJ-Bench pairs), and direct text-to-video generation (T2V). Each video pair is scored across 28 criteria, grouped under Alignment (object, attribute, action, count, location), Safety (crime, shocking, disgust, NSFW evasive/subtle, political sensitivity), Fineness (human/object distortion, blur), Coherence & Consistency (spatiotemporal metrics), and Bias & Fairness (demographic aspects) (Tong et al., 3 Feb 2025).

MJ-Bench Variant Modality Granular Criteria Groups/Aspects
MiJaBench Language Jailbreak prompt 16 minorities, 21 scenarios
MJ-Bench Text-to-Image 4 perspectives, subcats Alignment, Safety, Quality, Bias
MJ-Bench-Video Text-to-Video 28 criteria, 5 aspects Alignment, Safety, Fineness, Coherence, Bias

3. Evaluation Protocols

MiJaBench

Evaluation consists of generating 528,000 prompt–response pairs across 12 LLM variants (Llama-3, Gemma-3, Qwen-3; Nano to Large). The central metric is "Defense Rate":

DefenseRateg,m=1Successful Jailbreaksg,mPg\text{DefenseRate}_{g,m} = 1 - \frac{\text{Successful Jailbreaks}_{g,m}}{|P_g|}

Disparity is measured per model, quantifying deviation Δ for each group. Maximum range (up to 33 percentage points) and standard deviation σDR\sigma_{DR} are reported. Bootstrap analysis confirmed metric stability.

MJ-Bench & Video

Both MJ-Bench and MJ-Bench-Video deploy preference accuracy: judges win if they prefer the ground-truth positive over negative in a pair. For bias tasks, three measures are defined:

  • Pairwise Accuracy (ACC)
  • Gini-based Equality Score (GES): GES=1G,G=i,jsisj2n2μGES = 1 - G, G = \frac{\sum_{i,j}|s_i - s_j|}{2n^2\mu}
  • Normalized Dispersion Score (NDS): NDS=1σμ,σ=1ni(siμ)2NDS = 1 - \frac{\sigma}{\mu}, \sigma = \sqrt{\frac{1}{n} \sum_i (s_i-\mu)^2}

Annotation is both automated (LLM filtering) and human-expert-reviewed. Feedback is collected on numeric and Likert scales; input-order consistency is systematically validated (notably for open-source VLMs).

4. Key Findings

MiJaBench

Major findings include the identification of a demographic safety hierarchy: Defense Rate for the same LLM can vary by up to 33% between groups (e.g., Black +15% above mean, Mental Disability –11%). Larger models show improved global DR but increased disparity (dσ({DefenseRateg})dParams>0\frac{d\,\sigma(\{\text{DefenseRate}_g\})}{d\,\text{Params}} > 0), contradicting the ideal of fairness scaling. Strategy ablations reveal that logical rationalization and chain-of-thought jailbreaks sharply degrade DR, exposing cognitive vulnerabilities.

MJ-Bench

Closed-source VLMs (GPT-4o, Gemini Ultra) outperform others on overall feedback—GPT-4o achieves 100% safety, 98.7% quality, 82.5% bias NDS. Scoring models excel on alignment and image quality. Reasoning capabilities boost performance on nuanced safety and bias categories. Models are sensitive to scale and input order, with Likert scales improving consistency. Human evaluations confirm the superiority of closed-source feedback for fine-tuned generators.

MJ-Bench-Video

MJ-VIDEO, a Mixture-of-Experts reward model trained on MJ-Bench-Video, achieves strict aspect preference accuracy of 68.75% (vs. 58.47% for VideoScore, +17.58% gain). Ablation demonstrates that MoE layers are necessary for fine-grained and aggregate judgments. Integration into RLHF pipelines (e.g., VADER + VideoCrafter2) consistently enhances video generation quality and alignment as measured by both human and automated metrics.

5. Implications for Alignment, Scaling, and Fairness

MJ-Bench results indicate that contemporary alignment protocols do not instantiate a principle of non-discrimination but instead reinforce refusal triggers learned during fine-tuning, particularly for “head” groups with higher social visibility. Model scaling amplifies group disparity rather than mitigating it, evidenced by increasing σDR\sigma_{DR} with model parameter count. This challenges the expectation that larger capacity models generalize alignment in a way that guarantees demographic parity.

A theoretical implication is the necessity to incorporate demographic parity objectives into alignment pipelines: achieving uniform Defense Rate across all groups rather than simply maximizing average safety or preference score. The benchmarks further highlight the role of reasoning and multi-input holistic analysis in reliably handling safety and bias tasks.

6. Resources, Reproducibility, and Extensions

Datasets, evaluation scripts, and reference implementations for all MJ-Bench variants are publicly released—MiJaBench at https://github.com/iagoalvesb/mijabench, MJ-Bench at https://huggingface.co/MJ-Bench, and MJ-Bench-Video at https://aiming-lab.github.io/MJ-VIDEO.github.io/. Components include seed and scenario definitions, adversarial rewriting templates, annotation protocols, judge prompts, and evaluation routines. The MJ-Bench suite supports extensibility to new languages, modalities, intersectional categories, and expanded taxonomies for alignment research.

Dual-use risks are acknowledged; datasets are intended for controlled red-teaming and principled alignment studies.

7. Future Directions

Potential extensions suggested by current MJ-Bench research include expanding to additional modalities (e.g., audio, motion), hierarchical or scenario-conditioned reward architectures, and the use of continuous or self-supervised preference signals to mitigate annotation burdens. Further research is needed to formalize and optimize for demographic fairness and to expand interpretable multiaspect reward modeling for holistic safety and bias auditing.


Key references:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MJ-Bench.