ArtEmoBenchmark: Emotion in Art Evaluation

Updated 12 June 2026

ArtEmoBenchmark is a comprehensive benchmark suite designed for evaluating AI systems in interpreting and generating emotions in art, offering standardized tasks and metrics.
It leverages large-scale datasets like EmoArt and multimodal test sets to assess both text-to-image generation and audio-visual emotion comprehension with high annotation quality.
The suite supports reproducibility via detailed protocols and extensions to multiple modalities, fostering advancements in emotion analysis and art-driven AI research.

ArtEmoBenchmark is a comprehensive benchmarking suite for evaluating AI systems on the understanding and generation of emotion in artwork across multiple modalities, genres, and input types. It provides standardized tasks, metrics, and high-fidelity annotations, enabling rigorous assessment of models in emotion-driven art generation, perception, and cross-modal affect comprehension. ArtEmoBenchmark draws from large-scale, multidimensional datasets and defines protocols for both text-to-image diffusion models and multimodal LLMs, with particular emphasis on the complexity of emotion as conveyed through visual and auditory elements in artistic contexts.

1. Dataset Composition and Scope

ArtEmoBenchmark is grounded in two core resources: the EmoArt dataset and the ArtEmoBenchmark test set constructed for multimodal AVLMs.

EmoArt Dataset

Size and Diversity: 132,664 paintings spanning 56 distinct painting styles, including Impressionism, Surrealism, Ukiyo-e, and Modernism, ensuring comprehensive stylistic and cultural coverage.
Annotations:
- Detailed scene descriptions (average 35.6 words per entry).
- Five structured visual attributes: brushwork, composition, color, line, and light.
- Binary Arousal-Valence labeling (High/Low Arousal × Positive/Negative Valence).
- Twelve discrete emotion categories distributed to evenly cover the Russell circumplex: aroused, excited, happy, alarmed, annoyed, frustrated, sad, bored, tired, content, calm, glad.
- Art-therapy potential descriptors (e.g., calming, uplifting).
Annotation Quality: Annotations generated through a GPT-4o assisted, multi-stage pipeline with human verification over 5,600 image samples. High inter-annotator agreement is reported: 85.25% for emotion (Gwet’s AC1 = 0.785), 95.25% for visual attributes.
Distributional Properties: 87.93% of the corpus has positive valence; low-arousal emotions dominate. Style distribution ensures ≥400 images per painting genre.

ArtEmoBenchmark (Audio-Visual)

Source: 1,200 expertly curated 30 s movie clips from 1,432 films, each with emotionally motivated music.
MCQ Construction: The benchmark comprises 1,200 multiple-choice questions (MCQs), balanced across three input modalities: audio-only, visual-only, and audio-visual joint, with manual post-processing and verification.
Emotion Model: Continuous Valence-Arousal (V–A) labeling, with scores in [–1, +1] per axis, annotated independently by five experts per condition (audio-only, visual-only, audio-visual), then averaged.

2. Benchmark Tasks and Modalities

ArtEmoBenchmark specifies two main classes of tasks:

A. Image Generation Tasks (Text-to-Image)

Emotion-Conditional Generation: Given a specific painting style ‘s’ and an emotion ‘e’ (from the discrete set), generate an image $\hat{I} = G_A(s, e)$ that optimally conveys the target emotion in the designated style.
Full Text-to-Image Generation: Conditioned on $(s, v, a, d)$ where $v$ is valence, $a$ is arousal, and $d$ is a content-rich scene description; the task is to generate $\hat{I} = G_B(s, v, a, d)$ . The formal objective is:

$L(G) = \lambda_\text{attr}\,L_\text{attr}(G(x)) + \lambda_\text{img}\,L_\text{img}(G(x))$

with $L_\text{attr}$ as CLIP-based embedding cosine distance and $L_\text{img}$ as Fréchet Inception Distance (FID).

B. Multimodal Emotion Understanding Tasks

Audio-Only, Visual-Only, Audio-Visual Classification: On the 1,200 MCQs from ArtEmoBenchmark, models must answer questions (overall/specific, content/emotion) restricted to their assigned modality or combination thereof.
Types of Questions:
- OC/OE: Overall content and emotion (per modality).
- SC/SE: Specific content and specific emotion (per modality).
- Sp-A, Sp-V, Sp-AV: Audio-centric, video-centric, and cross-modal questions (for audiovisual tasks).

3. Evaluation Protocols and Metrics

Quantitative Metrics

Discrete Emotion Classification:
- Accuracy: $\text{Acc} = \frac{1}{N} \sum_{i = 1}^{N} 1\{y_i = \hat{y}_i\}$
- F1-score (per emotion), Precision, Recall.
Continuous V–A Correlation: Pearson’s $(s, v, a, d)$ 0 between annotated and predicted arousal/valence vectors.
CLIP Alignment Score: Cosine similarity between image embedding and text prompt embedding.
Fréchet Inception Distance (FID): For distributional similarity between generated and real artwork.
Pixel-Based Metrics: SSIM, PSNR, LPIPS (standard image quality definitions).
Attributes Alignment: Fine-tune a classifier on visual attributes; score as CLIP similarity between predicted attribute text and generated images.
Audio-Visual MCQ Accuracy: Simple proportion correct per subtask and overall mean.

Annotation Protocol and Agreement

High consistency is ensured by multiple human annotators, particularly for continuous V–A labeling, with all MCQs manually checked for correctness and clarity. In image tasks, attributes and emotions are verified on random samples with Gwet’s AC1 and percent agreement reported.

4. Baseline Models and Comparative Results

Diffusion Models (Art Generation)

Model	Overall Attribute Alignment	FID ↓	LPIPS ↓	SSIM ↑	PSNR ↑
FLUX 1-dev	0.6392	21.29	0.6706	0.2108	9.57
PixArt-σ	0.6505	36.23	0.6754	0.1658	8.99
SDXL	0.6385	61.93	0.7110	0.1677	9.13
FLUX 1-dev-finetuned	0.6604	31.65	0.6508	0.2102	9.66

Removal of emotion labels during fine-tuning reduces Attributes Alignment by 11% and increases FID by 4.3; exclusion of “Art Therapy” fields does not degrade standard metrics but does lower subjective emotional intensity.

Audio, Visual, and Audio-Visual LLMs

Model	Audio (A)	Visual (V)	Audio-Visual (AV)	Overall (All)
Qwen-Audio	52.5%	–	–	–
Qwen2.5-VL	–	72.3%	–	–
VideoLLaMA2	60.8%	81.3%	36.5%	59.5%
AffectGPT	51.5%	60.8%	42.0%	51.4%
VAEmotionLLM	77.0%	87.8%	53.5%	72.8%

Ablations confirm that integrated adapters and emotion modules in VAEmotionLLM provide complementary gains: Audio Adapter alone gives no gain; LoRA or single modules give moderate increases; the full VAEmotionLLM achieves the highest accuracy on every modality (Zhang et al., 15 Nov 2025).

5. Methodological Features and Extensions

Reproducibility

Detailed protocols facilitate reproducibility:

EmoArt is available at https://huggingface.co/datasets/printblue/EmoArt-130k under CC BY-NC 4.0.
Preprocessing: standardize all images to 512×512 with provided JSON annotations.
Training: LoRA fine-tuning on FLUX 1-dev (learning rate 1e–4, batch size 4, 5 epochs on 2,800 images).
Inference: CFG scale 7.5, 50 DDIM steps, fixed seed, template-based prompts.
Evaluation scripts for all metrics are provided in the repository.

Extensions

Modalities: Benchmark is extensible to video and 3D by mapping arousal–valence dimensions to temporal or volumetric controls.
Annotation Ablations: Researchers can disable individual attribute fields or emotion dimensions to analyze their impact on generation.
Emotion Classification: Zero-shot classifiers trained on EmoArt descriptions can be evaluated on generated samples for assessment of transferability.
New Model Adaptation: Prompt templates and LoRA recipes are compatible with new diffusion architectures (e.g., Imagen, DALL·E).

6. Relation to Prior Benchmarks and Multimodality

ArtEmoBenchmark is situated at the convergence of art-focused sentiment datasets such as ArtEmis and ArtELingo, and general multimodal affect understanding benchmarks:

Multilingual and Cross-cultural Expansion: ArtELingo extends prior collections with multilingual (Arabic, Chinese, Spanish) and culturally divergent annotations, highlighting systematic variations in emotion perception tied to culture (Mohamed et al., 2022).
Expressive Modalities: Other datasets, such as MEMO-Bench, inform the dual-axis structure (affect generation and understanding), progressive testing from coarse labels to fine intensity, and human-in-the-loop annotation methods (Zhou et al., 2024).
3D Emotionality: Related research in 3D facial expression rendering emphasizes axes of expression diversity, fluidity, and content alignment, suggesting plausible future extensions to 3D art-driven avatars (Xu et al., 2024).

7. Limitations and Future Directions

Despite its scale and methodological rigor, ArtEmoBenchmark retains several limitations:

Cultural bias: EmoArt and ArtEmoBenchmark are predominantly Western-centric in art corpus selection; cultural nuance among non-Western genres may be underrepresented.
Annotation model: While continuous V–A annotation is a notable strength, reliance on human verification in a limited language group constrains global emotional diversity.
Modal expansion: Video, 3D, haptic modalities, and dynamic interactive art are not fully addressed.
Metric comprehensiveness: Only accuracy is reported for MCQ tasks; future efforts could incorporate F1, calibration, and perceptual alignment metrics.

A plausible implication is that expanding ArtEmoBenchmark to cover additional art forms, languages, affect taxonomies, and modalities will further advance research in comprehensive machine understanding and generation of emotion in artistic media.

References:

"EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation" (Zhang et al., 4 Jun 2025)
"Learning to Hear by Seeing: It's Time for Vision LLMs to Understand Artistic Emotion from Sight and Sound" (Zhang et al., 15 Nov 2025)
"MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal LLMs on Human Emotion Analysis" (Zhou et al., 2024)
"ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture" (Mohamed et al., 2022)
"Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark" (Xu et al., 2024)

Markdown Report Issue Upgrade to Chat

References (5)

Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound (2025)

ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture (2022)

MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis (2024)

Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark (2024)

EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArtEmoBenchmark.

ArtEmoBenchmark: Emotion in Art Evaluation

1. Dataset Composition and Scope

EmoArt Dataset

ArtEmoBenchmark (Audio-Visual)

2. Benchmark Tasks and Modalities

A. Image Generation Tasks (Text-to-Image)

B. Multimodal Emotion Understanding Tasks

3. Evaluation Protocols and Metrics

Quantitative Metrics

Annotation Protocol and Agreement

4. Baseline Models and Comparative Results

Diffusion Models (Art Generation)

Audio, Visual, and Audio-Visual LLMs

5. Methodological Features and Extensions

Reproducibility

Extensions

6. Relation to Prior Benchmarks and Multimodality

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ArtEmoBenchmark: Emotion in Art Evaluation

1. Dataset Composition and Scope

EmoArt Dataset

ArtEmoBenchmark (Audio-Visual)

2. Benchmark Tasks and Modalities

A. Image Generation Tasks (Text-to-Image)

B. Multimodal Emotion Understanding Tasks

3. Evaluation Protocols and Metrics

Quantitative Metrics

Annotation Protocol and Agreement

4. Baseline Models and Comparative Results

Diffusion Models (Art Generation)

Audio, Visual, and Audio-Visual LLMs

5. Methodological Features and Extensions

Reproducibility

Extensions

6. Relation to Prior Benchmarks and Multimodality

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research