Alternative Use Task (AUT)
- The Alternative Use Task (AUT) is a divergent thinking measure that evaluates creativity by prompting novel uses for everyday objects.
- It employs stratified creativity prompts to elicit responses across varying levels of originality, practicality, and surprise.
- Empirical studies using AUT reveal that while humans excel in semantic novelty, LLMs demonstrate higher utility, with high inter-rater reliability and SRC scores.
The Alternative Use Task (AUT) is a prominent measure in the empirical study of divergent thinking, widely employed to evaluate creativity in both humans and artificial LLMs. Its core procedure requires subjects to generate novel uses for everyday objects, with creativity assessed through standardized rating or ranking methodologies. AUT research has evolved from human-centered scoring paradigms to sophisticated LLM benchmarking frameworks, enabling systematic and scalable assessment of creative output across agents and modalities (Rabeyah et al., 2024, Stevenson et al., 2022).
1. Concept and Structure of the AUT
The AUT operationalizes divergent thinking by eliciting a wide array of novel, functional applications for commonplace items. Participants—human or artificial—are prompted to list alternative uses for a given object (e.g., "fork," "book," "tin can"), typically under constrained time or list-length conditions. The evaluative focus is not solely on the quantity of responses but on their originality (novelty), flexibility (conceptual breadth), usefulness (practicality), and surprise (unexpectedness). This framework has been foundational in quantifying individual and algorithmic creative potential (Stevenson et al., 2022).
2. Creativity Tiering and Prompt Engineering
To rigorously test creative capacity and evaluative impartiality, AUT research frequently employs stratified creativity prompting. Al Rabeyah et al. (Rabeyah et al., 2024) delineate three non-overlapping creativity levels using explicit prompt variations:
- Common: A static prompt (e.g., “Create a list of 5 common uses for [object]”) elicits typical or frequently observed uses.
- Creative: A prompt encouraging "average creative" uses motivates responses outside the norm without explicit pressure (e.g., “Create a list of 5 creative alternative uses…”).
- Highly Creative: The base prompt is followed by a series of "forceful" augmentations (e.g., “Is this the best you can do? Try harder.”), driving models or participants to generate highly unorthodox but plausible responses.
This hierarchical prompt engineering enables systematic calibration of creative difficulty and response space across test items and agents. The resulting distribution of responses across creativity levels facilitates robust evaluation, comparison, and the construction of ground-truth benchmarks.
3. Evaluation Metrics and Oracle Benchmarking
AUT scoring conventionally relies on both human expert judgment and computational proxies. Expert judges rate each response for originality, usefulness, surprise, and flexibility using Likert-type scales, providing scores such as , , and for the -th response. Inter-rater reliability for these criteria commonly exceeds ICC = .7 (Stevenson et al., 2022). Additionally, automated metrics such as semantic distance are employed: for response and object ,
where denotes the embedding vector of (Stevenson et al., 2022).
A pivotal methodological advance is the construction of an "oracle" evaluation benchmark (Rabeyah et al., 2024). Here, responses from each creativity level and model are pooled and ordered in a canonical sequence: all highly creative responses first, then creative, then common, across all models tested. This fixed ordering enables objective alignment measurement for both scoring (average numerical score per group) and ranking (ordinal position per group).
4. LLM Evaluation Protocols
Recent research has generalized the AUT paradigm to LLMs both as respondents and as impartial evaluators. Al Rabeyah et al. (Rabeyah et al., 2024) evaluate four state-of-the-art commercial LLMs—Claude 3.5 Sonnet, Gemini 1.5 Flash, ChatGPT-4o, and ChatGPT-4—using the following dual-mode evaluation scheme:
- Scoring: Each model assigns a score to every alternative use, reflecting creativity level.
- Ranking: Each model orders a list of 0 responses, assigning a rank 1 (1=most creative) to each.
To test evaluation robustness, two item-presentation settings are employed:
- Comprehensive: All 60 responses for each object are presented at once.
- Segmented: Responses are split into groups of 12, each group evaluated separately; ranks (or scores) are then aggregated.
Spearman’s Rank Correlation Coefficient (SRC) quantifies agreement with the oracle and between models:
2
where 3 is the item-wise difference in ordering (Rabeyah et al., 2024).
5. Empirical Findings: Human vs. LLM and Inter-model Consensus
Comparative studies establish several core results:
- In direct GPT-3 vs. human comparisons, humans systematically outperform GPT-3 (davinci-002) on originality (4, 5), surprise, semantic distance, and average flexibility (6, 7). GPT-3, however, generates responses rated significantly more useful (8, 9) (Stevenson et al., 2022).
- Inter-group trade-offs are consistently observed, with a strong negative Pearson correlation between originality and utility: 0 (humans), 1 (GPT-3) (Stevenson et al., 2022).
- Variance in flexibility is higher for GPT-3, yet uncorrelated with sampling temperature, suggesting stochastic prompt sensitivity (Stevenson et al., 2022).
- Modern LLMs (Claude 3.5, Gemini 1.5, ChatGPT-4o/4), when used as judges, reach high SRC with the oracle benchmark: 2–3 in scoring and 4–5 in ranking under comprehensive evaluation; 6–7 in segmented settings (Rabeyah et al., 2024).
- Average inter-model SRCs range 8–9, indicating strong cross-model consensus and stability across test conditions (Rabeyah et al., 2024).
Summary of key Spearman correlation results from (Rabeyah et al., 2024):
| Setting & Measure | Model/Oracle SRC |
|---|---|
| Comprehensive Score | 0.95–0.97 |
| Comprehensive Rank | 0.77–0.95 |
| Segmented Score | 0.85–0.95 |
| Segmented Rank | 0.70–0.85 |
6. Self-Bias and Impartiality in Evaluation
Evaluation for self-bias tests whether an LLM rates its own generated outputs higher than those of competing models. Analysis across all combinations reveals no systematic favoring of self-generated responses: standard deviations of creativity scores remain low (0), with no spike in cross-model SRCs on self-benchmarking (Rabeyah et al., 2024). This impartiality supports the use of LLMs as unbiased evaluators, dispelling concerns of "in-group" model bias in creative assessment.
7. Implications and Applications
LLM-based AUT evaluation frameworks offer robust, scalable, and reproducible alternatives to human subject rating, achieving high alignment with human-inspired oracles and consistency across deployment parameters (Rabeyah et al., 2024). Comprehensive list evaluation marginally improves aggregator reliability, yet segmented approaches remain effective (1), facilitating parallelized assessment of large candidate pools.
The persistent distinction between human and LLM generative performance—humans excelling in semantic novelty, category breadth, and surprise, LLMs excelling in utility—suggests current LLMs remain better at plausible inference than at unconstrained divergent thinking (Stevenson et al., 2022). These findings validate the AUT as a rigorous benchmark for both creativity modeling and automation of creativity assessment, with immediate impact on education, design, and computational creativity research.
A plausible implication is that future models may benefit from innovations explicitly targeting semantic distance and flexibility, if the goal is to consistently match or surpass human-level divergent thinking on standardized measures such as the AUT.