- The paper introduces the CROSS benchmark, comprising 1,284 multilingual, visually grounded queries from 16 countries to assess cultural norm violations in LVLMs.
- The paper presents the μCROSS framework with four dimensions—awareness, education, compliance, and helpfulness—evaluated using tailored GPT-4o prompts validated against human judgments.
- The paper explores Safety-SFT and Safety-DPO alignment strategies, showing that targeted fine-tuning enhances cultural compliance while maintaining general multimodal capabilities.
This paper, "Multimodal Cultural Safety: Evaluation Frameworks and Alignment Strategies" (2505.14972), addresses the critical need for Large Vision-LLMs (LVLMs) to operate safely and appropriately across diverse cultural contexts. While existing safety benchmarks primarily focus on physical harm, the authors highlight the importance of evaluating and mitigating symbolic harm, which arises from violating cultural norms, especially when visual information is misinterpreted.
The paper introduces CROSS (\underline{C}ultural \underline{R}easoning \underline{O}ver Multimodal \underline{S}cenes for \underline{S}afety Evaluation), a new benchmark specifically designed to assess LVLMs' cultural safety reasoning. CROSS contains 1,284 multilingual, visually grounded queries from 16 countries, across three everyday domains: shopping, meal planning, and outdoor activities. A key characteristic of CROSS instances is that cultural norm violations only become apparent when both the image and the query are interpreted in context. For example, recommending clocks as gifts for a baby's birthday in China, while visually suitable, violates a cultural norm related to death. The benchmark is constructed by building upon existing text-only cultural datasets (SafeWorld and CASA) and pairing norms with relevant images and carefully rewritten, context-dependent queries. It includes multilingual support across 14 languages.
To evaluate models on CROSS, the authors propose μCROSS, an intercultural theory-based framework with four key dimensions:
- Awareness: Does the model recognize culturally specific norms from the text and image?
- Education: Does the model explain the meaning or rationale behind the cultural norm?
- Compliance: Does the model respect symbolic meanings and adhere to culturally appropriate norms?
- Helpfulness: Does the model offer respectful, practical, and culturally aware advice?
These dimensions, inspired by the Intercultural Sensitivity Scale, are automatically evaluated using GPT-4o with tailored prompts, and the automatic evaluation is validated against human judgments.
The paper evaluates 21 leading LVLMs (both open-source and closed-source) on the CROSS benchmark using the μCROSS framework. The results reveal significant cultural safety gaps. Even the best-performing model, Gemini-2.5-Pro, achieves only 61.79% in cultural awareness and 37.73% in compliance on the English dataset. Open-source models generally perform worse than proprietary models, although some open-source models (like the Llama-4 series) can reach GPT-4o's performance level on English data. The evaluation also shows that increasing reasoning capacity offers limited improvement in cultural alignment and that performance often drops significantly when models are evaluated in localized languages compared to English. A country-level analysis highlights varying performance across different regions and suggests a strong correlation between awareness and compliance.
To address the observed performance gaps, the paper explores two alignment strategies:
- Safety-focused Supervised Fine-Tuning (Safety-SFT): This method involves converting culturally grounded multiple-choice questions from the CVQA dataset into safety-relevant, open-ended scenarios and fine-tuning models (specifically GPT-4o via its text-only API) on pairs of these scenarios and culturally safe responses. This approach yielded substantial improvements in cultural safety performance (+37% to +60% awareness and compliance) but resulted in moderate performance drops (around 2-5%) on general multimodal understanding benchmarks like MMMU and MME. The effectiveness of this SFT was found to be contingent on the training data covering the same countries as the evaluation data, highlighting a limitation in generalizability without broad cultural coverage.
- Dimension-Aware Preference Tuning (Safety-DPO): This strategy leverages contrastive response pairs derived from the same converted CVQA scenarios. For each scenario, a culturally safe response is paired with a culturally unsafe response. These unsafe responses are specifically designed to fail in one or more of the four μCROSS dimensions. DPO is performed on GPT-4o using these pairs via the text-only API. This method also showed improvements in cultural safety (3% to 28% gains) but with minimal impact (less than 2% drop) on general multimodal understanding benchmarks. An ablation paper on the types of negative responses confirmed that using a mix of negative examples targeting different dimensions was most effective for balanced improvements.
Applying these alignment methods to open-source models like InternVL2.5 showed only minimal improvements, suggesting that a lack of fundamental cultural grounding during pretraining limits the effectiveness of subsequent alignment efforts.
In conclusion, the paper establishes a definition and evaluation framework for multimodal cultural safety, demonstrates significant weaknesses in current leading LVLMs, and proposes practical alignment strategies (SFT and DPO) using automatically generated, culturally grounded data. While SFT yields larger safety gains, DPO appears more promising for maintaining general capabilities, though both approaches require sufficient cultural knowledge in the base model to be effective. The work emphasizes the urgent need for culturally informed evaluation and alignment to deploy trustworthy LVLMs globally. Data and code for the CROSS benchmark are available at \url{https://github.com/haoyiq114/CROSS}.