Papers
Topics
Authors
Recent
Search
2000 character limit reached

ExpressivityArena: Benchmarking Generative Expressivity

Updated 8 June 2026
  • ExpressivityArena is a formally defined evaluation framework that quantifies a model's ability to implicitly convey specified signals without explicit mention.
  • It uses a modular, pip-installable Python library to automate prompt construction, explicit-mention filtering, and ensemble grading for reliable expressivity measurement.
  • Empirical results across domains like poetry, code, and music reveal variable model performance and signal leakage issues, guiding future model enhancements.

ExpressivityArena is a formally defined, multi-domain evaluation framework and library that rigorously measures the capacity of generative models—especially LLMs—to express information implicitly. Its methodology centers on the quantification of "expressivity," defined as a model's ability to convey specified signals or facets (such as emotion, style, or domain constraint) by showing rather than telling, thereby mirroring pragmatic and contextual inference as studied in linguistics. The ExpressivityArena paradigm underpins both empirical research and practical benchmarking in the assessment of generative and co-creative AI systems across language, music, and procedural content generation domains (Tint et al., 2024, Louie et al., 2021, Withington et al., 2023, Alvarez et al., 2020).

1. Formal Definitions and Core Metric

ExpressivityArena operationalizes "expressivity" as the ability of a model ftest:XYf_{\text{test}}: X \rightarrow Y to generate outputs xoutx_{\text{out}} that communicate a selected signal ss from a category SC={s1,,sK}S_C = \{s_1, \dots, s_K\} without directly mentioning ss. Given a prompt specifying a domain dd (e.g., "poem", "code") and signal ss, the model produces xout(s)x_{\text{out}}(s). A blind grader fgraderf_{\text{grader}} is then used to infer which signal was intended from SCS_C. The central quantitative performance indicator is the expressivity rate:

xoutx_{\text{out}}0

where xoutx_{\text{out}}1, xoutx_{\text{out}}2 with xoutx_{\text{out}}3 independent replications per signal, and xoutx_{\text{out}}4 is the indicator function. This metric directly captures the success rate with which generated artifacts encode implicit, context-driven information (Tint et al., 2024).

ExpressivityArena deters explicit mention of xoutx_{\text{out}}5 through filtering and resampling. For additional rigor, signal set "difficulty" is quantified using average pairwise cosine distance among signal embeddings, contextualizing the expected discrimination challenge across signals.

2. Framework Architecture and Workflow

ExpressivityArena is implemented as a modular, pip-installable Python library, automating the testing, grading, and reporting pipeline. The architecture comprises the following principal modules (Tint et al., 2024):

  • expressivity_arena.core: Orchestrates experiment definition, signal management, prompt construction, and execution.
  • expressivity_arena.models: Wraps API calls to supported LLMs for both xoutx_{\text{out}}6 and xoutx_{\text{out}}7, implements ensemble "jury" grading.
  • expressivity_arena.metrics: Calculates xoutx_{\text{out}}8, embedding-based difficulty, and confusion matrices.
  • expressivity_arena.utils: Handles text post-processing (explicit-mention removal), and result visualization.

The canonical workflow is as follows:

  1. Define test and grader models, domain xoutx_{\text{out}}9, and signal set ss0.
  2. For each ss1, generate prompts enforcing "show, don’t tell".
  3. Invoke ss2 for ss3 samples per ss4.
  4. Explicit-mention filter screens outputs; if violated, output is regenerated.
  5. Outputs are passed to ss5 for forced-choice selection from ss6.
  6. Rates, confusion matrices, and optional difficulty indices are computed and reported.

ExpressivityArena supports both single-turn and multi-turn conversational scenarios, capturing expressivity drift and signal leakage over dialogue iterations.

3. Domain-Specific Methodologies and Experimental Designs

ExpressivityArena is applied in diverse domains with tailored evaluation protocols:

  • LLMs: Tasks include poetry generation (28 emotions, 34 poet styles), code generation (skill levels, paradigms), and multi-turn profession/emotion conversation. Prompts enforce non-explicit signaling; graders are leading LLMs (e.g., GPT-4o, Llama3, Gemma) or human annotators. Jury grading is employed for robustness (Tint et al., 2024).
  • Music Co-Creation ("Expressive Communication"): Composers create musical phrases (15 s) in response to prompts ("cards" with image and keywords) using variable model/interface conditions. Composer self-reports (Likert scales on expressiveness, ownership, efficacy, etc.) and blinded listener judgments (forced-choice, five-point scale) are jointly analyzed for communication effectiveness (Louie et al., 2021).
  • Procedural Content Generation (PCG): ExpressivityArena overlaps conceptually with Expressive Range Analysis (ERA), where the generator's output space is projected via pairs of uncorrelated metrics to reveal diversity and coverage (see Section 4). Dynamic, mixed-initiative tools use ExpressivityArena-like dashboards to make diversity, coverage, and fitness statistics accessible in real time (Withington et al., 2023, Alvarez et al., 2020).

ExpressivityArena's formalism is conceptually allied to Expressive Range Analysis (ERA) and quality-diversity algorithms such as MAP-Elites:

  • ERA: A 2D visualization technique, where each artifact is mapped via two descriptive metrics, enabling inspection of a generator's coverage in its possibility space (Withington et al., 2023). Best-practice metric selection is governed by mutual independence (Pearson correlation), distribution evenness (FI), and coverage of alternate metrics (AMC). This systematic screening counters the pitfalls of arbitrary axis selection and redundancy.
  • Interactive MAP-Elites: Implements an illumination algorithm that maintains diversity across multiple behavior dimensions, emphasizing "expressive range" as the extent to which diverse, high-quality artifacts populate the feature space. Interactive, mixed-initiative variants allow real-time feature-space exploration and immediate feedback on coverage, fitness, and uniqueness (Alvarez et al., 2020).
  • Music Model/Interface Arena: In the music communication variant, the framework enables controlled side-by-side evaluation of generative model expressivity and interface steerability, with both composer and listener perspectives rigorously quantified (Louie et al., 2021).

These frameworks share the methodological principle of quantifying the diversity, coverage, and informativeness of generative models' output spaces under varying constraints and user interventions.

5. Empirical Findings and Quantitative Results

Systematic experiments demonstrate significant variation in expressive capacity across domains, models, and task types (Tint et al., 2024):

Domain-Task Best ss7 Worst ss8
Poetry–Emotions 0.70 (Llama2) 0.59 (Gemma)
Poetry–Poets 0.70 (GPT-4, Llama3) 0.53 (Gemma)
Code–Skill-Level 0.54 (GPT-4) 0.31 (Gemma)
Code–Paradigm 0.83 (GPT-4o) 0.50 (Gemma)

Additional findings include:

  • Higher expressivity is consistently observed in creative (e.g., poetry) versus technical (e.g., code) domains.
  • Code paradigm can be signaled more effectively than skill level, with substantial model variance.
  • Expressivity in conversation degrades for emotional signals over successive turns (e.g., GPT-3.5: ~0.64→0.50); profession expressivity increases (signal leakage over dialogue).
  • Top automated graders (e.g., GPT-4o) match or exceed mean human accuracy (human: 0.65–0.75; GPT-4o: ~0.77–0.79), with qualitative agreement in confusion patterns.
  • Biases in grading (e.g., misclassification of female poets, semantic overlap in emotions) are evident from confusion matrices.
  • In music co-creation, both more expressive models (e.g., Music Transformer) and more steerable interfaces (chunkwise selection with semantic filtering) yield quantifiably higher listener agreement and self-reported creative empowerment (Louie et al., 2021).

6. Practical Implications, Design Principles, and Limitations

ExpressivityArena reveals uneven performance and specific limitations in generative systems:

  • Fine-tuning objectives may require targeting implicit communication directly (e.g., by penalizing over-explicitness or optimizing for human-like pragmatics).
  • Low expressivity in code generation warns that model-authored code may lack desired idiomatic or stylistic fidelity, necessitating prompt engineering or specialized training data.
  • Detected biases and confusion patterns in style (gender, semantic overlaps) underscore the need for balanced representation in both training and evaluation data.
  • In dialogue, multi-turn expressivity suffers from signal drift or leakage; tracking time-series expressivity highlights this deficit and motivates research on expressivity retention mechanisms.

Pitfalls include grader bias (LLMs may over-infer subtle cues not evident to humans), and insufficient coverage of expressive domains such as metaphor or sarcasm (which remain to be systematically benchmarked) (Tint et al., 2024).

7. Software Usage and Availability

The ExpressivityArena library is open source and accessible at https://github.com/asu-nlp/ExpressivityArena. Installation and deployment is via PyPI: ss9 A minimal example for poetry-emotion expressivity evaluation:

SC={s1,,sK}S_C = \{s_1, \dots, s_K\}0

Comprehensive APIs support domain generalization, ensemble grading, explicit-mention detection and re-query, confusion matrix plotting, and conversational expressivity time series reporting. All evaluation results, including confusion matrices and coverage/time-series plots, are generated as part of the canonical workflow (Tint et al., 2024).


ExpressivityArena thereby delivers a generalizable, quantifiable, and extensible standard for benchmarking implicit communication in generative AI, bridging linguistics, music, language modeling, and procedural content generation with unified, reproducible protocols and metrics.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ExpressivityArena.