Papers
Topics
Authors
Recent
Search
2000 character limit reached

ArtEmoBenchmark: Emotion in Art Evaluation

Updated 12 June 2026
  • ArtEmoBenchmark is a comprehensive benchmark suite designed for evaluating AI systems in interpreting and generating emotions in art, offering standardized tasks and metrics.
  • It leverages large-scale datasets like EmoArt and multimodal test sets to assess both text-to-image generation and audio-visual emotion comprehension with high annotation quality.
  • The suite supports reproducibility via detailed protocols and extensions to multiple modalities, fostering advancements in emotion analysis and art-driven AI research.

ArtEmoBenchmark is a comprehensive benchmarking suite for evaluating AI systems on the understanding and generation of emotion in artwork across multiple modalities, genres, and input types. It provides standardized tasks, metrics, and high-fidelity annotations, enabling rigorous assessment of models in emotion-driven art generation, perception, and cross-modal affect comprehension. ArtEmoBenchmark draws from large-scale, multidimensional datasets and defines protocols for both text-to-image diffusion models and multimodal LLMs, with particular emphasis on the complexity of emotion as conveyed through visual and auditory elements in artistic contexts.

1. Dataset Composition and Scope

ArtEmoBenchmark is grounded in two core resources: the EmoArt dataset and the ArtEmoBenchmark test set constructed for multimodal AVLMs.

EmoArt Dataset

  • Size and Diversity: 132,664 paintings spanning 56 distinct painting styles, including Impressionism, Surrealism, Ukiyo-e, and Modernism, ensuring comprehensive stylistic and cultural coverage.
  • Annotations:
    • Detailed scene descriptions (average 35.6 words per entry).
    • Five structured visual attributes: brushwork, composition, color, line, and light.
    • Binary Arousal-Valence labeling (High/Low Arousal × Positive/Negative Valence).
    • Twelve discrete emotion categories distributed to evenly cover the Russell circumplex: aroused, excited, happy, alarmed, annoyed, frustrated, sad, bored, tired, content, calm, glad.
    • Art-therapy potential descriptors (e.g., calming, uplifting).
  • Annotation Quality: Annotations generated through a GPT-4o assisted, multi-stage pipeline with human verification over 5,600 image samples. High inter-annotator agreement is reported: 85.25% for emotion (Gwet’s AC1 = 0.785), 95.25% for visual attributes.
  • Distributional Properties: 87.93% of the corpus has positive valence; low-arousal emotions dominate. Style distribution ensures ≥400 images per painting genre.

ArtEmoBenchmark (Audio-Visual)

  • Source: 1,200 expertly curated 30 s movie clips from 1,432 films, each with emotionally motivated music.
  • MCQ Construction: The benchmark comprises 1,200 multiple-choice questions (MCQs), balanced across three input modalities: audio-only, visual-only, and audio-visual joint, with manual post-processing and verification.
  • Emotion Model: Continuous Valence-Arousal (V–A) labeling, with scores in [–1, +1] per axis, annotated independently by five experts per condition (audio-only, visual-only, audio-visual), then averaged.

2. Benchmark Tasks and Modalities

ArtEmoBenchmark specifies two main classes of tasks:

A. Image Generation Tasks (Text-to-Image)

  • Emotion-Conditional Generation: Given a specific painting style ‘s’ and an emotion ‘e’ (from the discrete set), generate an image I^=GA(s,e)\hat{I} = G_A(s, e) that optimally conveys the target emotion in the designated style.
  • Full Text-to-Image Generation: Conditioned on (s,v,a,d)(s, v, a, d) where vv is valence, aa is arousal, and dd is a content-rich scene description; the task is to generate I^=GB(s,v,a,d)\hat{I} = G_B(s, v, a, d). The formal objective is:

L(G)=λattrLattr(G(x))+λimgLimg(G(x))L(G) = \lambda_\text{attr}\,L_\text{attr}(G(x)) + \lambda_\text{img}\,L_\text{img}(G(x))

with LattrL_\text{attr} as CLIP-based embedding cosine distance and LimgL_\text{img} as Fréchet Inception Distance (FID).

B. Multimodal Emotion Understanding Tasks

  • Audio-Only, Visual-Only, Audio-Visual Classification: On the 1,200 MCQs from ArtEmoBenchmark, models must answer questions (overall/specific, content/emotion) restricted to their assigned modality or combination thereof.
  • Types of Questions:
    • OC/OE: Overall content and emotion (per modality).
    • SC/SE: Specific content and specific emotion (per modality).
    • Sp-A, Sp-V, Sp-AV: Audio-centric, video-centric, and cross-modal questions (for audiovisual tasks).

3. Evaluation Protocols and Metrics

Quantitative Metrics

  • Discrete Emotion Classification:
    • Accuracy: Acc=1Ni=1N1{yi=y^i}\text{Acc} = \frac{1}{N} \sum_{i = 1}^{N} 1\{y_i = \hat{y}_i\}
    • F1-score (per emotion), Precision, Recall.
  • Continuous V–A Correlation: Pearson’s (s,v,a,d)(s, v, a, d)0 between annotated and predicted arousal/valence vectors.
  • CLIP Alignment Score: Cosine similarity between image embedding and text prompt embedding.
  • Fréchet Inception Distance (FID): For distributional similarity between generated and real artwork.
  • Pixel-Based Metrics: SSIM, PSNR, LPIPS (standard image quality definitions).
  • Attributes Alignment: Fine-tune a classifier on visual attributes; score as CLIP similarity between predicted attribute text and generated images.
  • Audio-Visual MCQ Accuracy: Simple proportion correct per subtask and overall mean.

Annotation Protocol and Agreement

High consistency is ensured by multiple human annotators, particularly for continuous V–A labeling, with all MCQs manually checked for correctness and clarity. In image tasks, attributes and emotions are verified on random samples with Gwet’s AC1 and percent agreement reported.

4. Baseline Models and Comparative Results

Diffusion Models (Art Generation)

Model Overall Attribute Alignment FID ↓ LPIPS ↓ SSIM ↑ PSNR ↑
FLUX 1-dev 0.6392 21.29 0.6706 0.2108 9.57
PixArt-σ 0.6505 36.23 0.6754 0.1658 8.99
SDXL 0.6385 61.93 0.7110 0.1677 9.13
FLUX 1-dev-finetuned 0.6604 31.65 0.6508 0.2102 9.66

Removal of emotion labels during fine-tuning reduces Attributes Alignment by 11% and increases FID by 4.3; exclusion of “Art Therapy” fields does not degrade standard metrics but does lower subjective emotional intensity.

Audio, Visual, and Audio-Visual LLMs

Model Audio (A) Visual (V) Audio-Visual (AV) Overall (All)
Qwen-Audio 52.5%
Qwen2.5-VL 72.3%
VideoLLaMA2 60.8% 81.3% 36.5% 59.5%
AffectGPT 51.5% 60.8% 42.0% 51.4%
VAEmotionLLM 77.0% 87.8% 53.5% 72.8%

Ablations confirm that integrated adapters and emotion modules in VAEmotionLLM provide complementary gains: Audio Adapter alone gives no gain; LoRA or single modules give moderate increases; the full VAEmotionLLM achieves the highest accuracy on every modality (Zhang et al., 15 Nov 2025).

5. Methodological Features and Extensions

Reproducibility

Detailed protocols facilitate reproducibility:

  • EmoArt is available at https://huggingface.co/datasets/printblue/EmoArt-130k under CC BY-NC 4.0.
  • Preprocessing: standardize all images to 512×512 with provided JSON annotations.
  • Training: LoRA fine-tuning on FLUX 1-dev (learning rate 1e–4, batch size 4, 5 epochs on 2,800 images).
  • Inference: CFG scale 7.5, 50 DDIM steps, fixed seed, template-based prompts.
  • Evaluation scripts for all metrics are provided in the repository.

Extensions

  • Modalities: Benchmark is extensible to video and 3D by mapping arousal–valence dimensions to temporal or volumetric controls.
  • Annotation Ablations: Researchers can disable individual attribute fields or emotion dimensions to analyze their impact on generation.
  • Emotion Classification: Zero-shot classifiers trained on EmoArt descriptions can be evaluated on generated samples for assessment of transferability.
  • New Model Adaptation: Prompt templates and LoRA recipes are compatible with new diffusion architectures (e.g., Imagen, DALL·E).

6. Relation to Prior Benchmarks and Multimodality

ArtEmoBenchmark is situated at the convergence of art-focused sentiment datasets such as ArtEmis and ArtELingo, and general multimodal affect understanding benchmarks:

  • Multilingual and Cross-cultural Expansion: ArtELingo extends prior collections with multilingual (Arabic, Chinese, Spanish) and culturally divergent annotations, highlighting systematic variations in emotion perception tied to culture (Mohamed et al., 2022).
  • Expressive Modalities: Other datasets, such as MEMO-Bench, inform the dual-axis structure (affect generation and understanding), progressive testing from coarse labels to fine intensity, and human-in-the-loop annotation methods (Zhou et al., 2024).
  • 3D Emotionality: Related research in 3D facial expression rendering emphasizes axes of expression diversity, fluidity, and content alignment, suggesting plausible future extensions to 3D art-driven avatars (Xu et al., 2024).

7. Limitations and Future Directions

Despite its scale and methodological rigor, ArtEmoBenchmark retains several limitations:

  • Cultural bias: EmoArt and ArtEmoBenchmark are predominantly Western-centric in art corpus selection; cultural nuance among non-Western genres may be underrepresented.
  • Annotation model: While continuous V–A annotation is a notable strength, reliance on human verification in a limited language group constrains global emotional diversity.
  • Modal expansion: Video, 3D, haptic modalities, and dynamic interactive art are not fully addressed.
  • Metric comprehensiveness: Only accuracy is reported for MCQ tasks; future efforts could incorporate F1, calibration, and perceptual alignment metrics.

A plausible implication is that expanding ArtEmoBenchmark to cover additional art forms, languages, affect taxonomies, and modalities will further advance research in comprehensive machine understanding and generation of emotion in artistic media.


References:

  • "EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation" (Zhang et al., 4 Jun 2025)
  • "Learning to Hear by Seeing: It's Time for Vision LLMs to Understand Artistic Emotion from Sight and Sound" (Zhang et al., 15 Nov 2025)
  • "MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal LLMs on Human Emotion Analysis" (Zhou et al., 2024)
  • "ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture" (Mohamed et al., 2022)
  • "Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark" (Xu et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ArtEmoBenchmark.