ExpressivityArena: Benchmarking Generative Expressivity
- ExpressivityArena is a formally defined evaluation framework that quantifies a model's ability to implicitly convey specified signals without explicit mention.
- It uses a modular, pip-installable Python library to automate prompt construction, explicit-mention filtering, and ensemble grading for reliable expressivity measurement.
- Empirical results across domains like poetry, code, and music reveal variable model performance and signal leakage issues, guiding future model enhancements.
ExpressivityArena is a formally defined, multi-domain evaluation framework and library that rigorously measures the capacity of generative models—especially LLMs—to express information implicitly. Its methodology centers on the quantification of "expressivity," defined as a model's ability to convey specified signals or facets (such as emotion, style, or domain constraint) by showing rather than telling, thereby mirroring pragmatic and contextual inference as studied in linguistics. The ExpressivityArena paradigm underpins both empirical research and practical benchmarking in the assessment of generative and co-creative AI systems across language, music, and procedural content generation domains (Tint et al., 2024, Louie et al., 2021, Withington et al., 2023, Alvarez et al., 2020).
1. Formal Definitions and Core Metric
ExpressivityArena operationalizes "expressivity" as the ability of a model to generate outputs that communicate a selected signal from a category without directly mentioning . Given a prompt specifying a domain (e.g., "poem", "code") and signal , the model produces . A blind grader is then used to infer which signal was intended from . The central quantitative performance indicator is the expressivity rate:
0
where 1, 2 with 3 independent replications per signal, and 4 is the indicator function. This metric directly captures the success rate with which generated artifacts encode implicit, context-driven information (Tint et al., 2024).
ExpressivityArena deters explicit mention of 5 through filtering and resampling. For additional rigor, signal set "difficulty" is quantified using average pairwise cosine distance among signal embeddings, contextualizing the expected discrimination challenge across signals.
2. Framework Architecture and Workflow
ExpressivityArena is implemented as a modular, pip-installable Python library, automating the testing, grading, and reporting pipeline. The architecture comprises the following principal modules (Tint et al., 2024):
expressivity_arena.core: Orchestrates experiment definition, signal management, prompt construction, and execution.expressivity_arena.models: Wraps API calls to supported LLMs for both 6 and 7, implements ensemble "jury" grading.expressivity_arena.metrics: Calculates 8, embedding-based difficulty, and confusion matrices.expressivity_arena.utils: Handles text post-processing (explicit-mention removal), and result visualization.
The canonical workflow is as follows:
- Define test and grader models, domain 9, and signal set 0.
- For each 1, generate prompts enforcing "show, don’t tell".
- Invoke 2 for 3 samples per 4.
- Explicit-mention filter screens outputs; if violated, output is regenerated.
- Outputs are passed to 5 for forced-choice selection from 6.
- Rates, confusion matrices, and optional difficulty indices are computed and reported.
ExpressivityArena supports both single-turn and multi-turn conversational scenarios, capturing expressivity drift and signal leakage over dialogue iterations.
3. Domain-Specific Methodologies and Experimental Designs
ExpressivityArena is applied in diverse domains with tailored evaluation protocols:
- LLMs: Tasks include poetry generation (28 emotions, 34 poet styles), code generation (skill levels, paradigms), and multi-turn profession/emotion conversation. Prompts enforce non-explicit signaling; graders are leading LLMs (e.g., GPT-4o, Llama3, Gemma) or human annotators. Jury grading is employed for robustness (Tint et al., 2024).
- Music Co-Creation ("Expressive Communication"): Composers create musical phrases (15 s) in response to prompts ("cards" with image and keywords) using variable model/interface conditions. Composer self-reports (Likert scales on expressiveness, ownership, efficacy, etc.) and blinded listener judgments (forced-choice, five-point scale) are jointly analyzed for communication effectiveness (Louie et al., 2021).
- Procedural Content Generation (PCG): ExpressivityArena overlaps conceptually with Expressive Range Analysis (ERA), where the generator's output space is projected via pairs of uncorrelated metrics to reveal diversity and coverage (see Section 4). Dynamic, mixed-initiative tools use ExpressivityArena-like dashboards to make diversity, coverage, and fitness statistics accessible in real time (Withington et al., 2023, Alvarez et al., 2020).
4. Related Evaluation Frameworks: Expressive Range Analysis and MAP-Elites
ExpressivityArena's formalism is conceptually allied to Expressive Range Analysis (ERA) and quality-diversity algorithms such as MAP-Elites:
- ERA: A 2D visualization technique, where each artifact is mapped via two descriptive metrics, enabling inspection of a generator's coverage in its possibility space (Withington et al., 2023). Best-practice metric selection is governed by mutual independence (Pearson correlation), distribution evenness (FI), and coverage of alternate metrics (AMC). This systematic screening counters the pitfalls of arbitrary axis selection and redundancy.
- Interactive MAP-Elites: Implements an illumination algorithm that maintains diversity across multiple behavior dimensions, emphasizing "expressive range" as the extent to which diverse, high-quality artifacts populate the feature space. Interactive, mixed-initiative variants allow real-time feature-space exploration and immediate feedback on coverage, fitness, and uniqueness (Alvarez et al., 2020).
- Music Model/Interface Arena: In the music communication variant, the framework enables controlled side-by-side evaluation of generative model expressivity and interface steerability, with both composer and listener perspectives rigorously quantified (Louie et al., 2021).
These frameworks share the methodological principle of quantifying the diversity, coverage, and informativeness of generative models' output spaces under varying constraints and user interventions.
5. Empirical Findings and Quantitative Results
Systematic experiments demonstrate significant variation in expressive capacity across domains, models, and task types (Tint et al., 2024):
| Domain-Task | Best 7 | Worst 8 |
|---|---|---|
| Poetry–Emotions | 0.70 (Llama2) | 0.59 (Gemma) |
| Poetry–Poets | 0.70 (GPT-4, Llama3) | 0.53 (Gemma) |
| Code–Skill-Level | 0.54 (GPT-4) | 0.31 (Gemma) |
| Code–Paradigm | 0.83 (GPT-4o) | 0.50 (Gemma) |
Additional findings include:
- Higher expressivity is consistently observed in creative (e.g., poetry) versus technical (e.g., code) domains.
- Code paradigm can be signaled more effectively than skill level, with substantial model variance.
- Expressivity in conversation degrades for emotional signals over successive turns (e.g., GPT-3.5: ~0.64→0.50); profession expressivity increases (signal leakage over dialogue).
- Top automated graders (e.g., GPT-4o) match or exceed mean human accuracy (human: 0.65–0.75; GPT-4o: ~0.77–0.79), with qualitative agreement in confusion patterns.
- Biases in grading (e.g., misclassification of female poets, semantic overlap in emotions) are evident from confusion matrices.
- In music co-creation, both more expressive models (e.g., Music Transformer) and more steerable interfaces (chunkwise selection with semantic filtering) yield quantifiably higher listener agreement and self-reported creative empowerment (Louie et al., 2021).
6. Practical Implications, Design Principles, and Limitations
ExpressivityArena reveals uneven performance and specific limitations in generative systems:
- Fine-tuning objectives may require targeting implicit communication directly (e.g., by penalizing over-explicitness or optimizing for human-like pragmatics).
- Low expressivity in code generation warns that model-authored code may lack desired idiomatic or stylistic fidelity, necessitating prompt engineering or specialized training data.
- Detected biases and confusion patterns in style (gender, semantic overlaps) underscore the need for balanced representation in both training and evaluation data.
- In dialogue, multi-turn expressivity suffers from signal drift or leakage; tracking time-series expressivity highlights this deficit and motivates research on expressivity retention mechanisms.
Pitfalls include grader bias (LLMs may over-infer subtle cues not evident to humans), and insufficient coverage of expressive domains such as metaphor or sarcasm (which remain to be systematically benchmarked) (Tint et al., 2024).
7. Software Usage and Availability
The ExpressivityArena library is open source and accessible at https://github.com/asu-nlp/ExpressivityArena. Installation and deployment is via PyPI: 9 A minimal example for poetry-emotion expressivity evaluation:
0
Comprehensive APIs support domain generalization, ensemble grading, explicit-mention detection and re-query, confusion matrix plotting, and conversational expressivity time series reporting. All evaluation results, including confusion matrices and coverage/time-series plots, are generated as part of the canonical workflow (Tint et al., 2024).
ExpressivityArena thereby delivers a generalizable, quantifiable, and extensible standard for benchmarking implicit communication in generative AI, bridging linguistics, music, language modeling, and procedural content generation with unified, reproducible protocols and metrics.