Consensual Assessment Technique & LLM Adaptations
- Consensual Assessment Technique (CAT) is a framework that evaluates creativity through domain-expert holistic judgments without fixed criteria.
- It uses numerical rating scales and inter-rater reliability metrics, such as Cronbach’s α and ICC, to aggregate expert evaluations.
- Recent adaptations integrate advanced LLMs and forced-choice prompts to perform scalable, computational creativity assessments.
The Consensual Assessment Technique (CAT) is a methodological framework for evaluating creativity through domain-expert holistic judgments without predefined subcriteria. Originally introduced by Amabile (1983), CAT requires multiple independent experts to rate a shared set of outputs using a numerical scale and assesses reliability via inter-rater metrics such as Cronbach’s α and intraclass correlation. Recent work has adapted CAT for computational evaluation, employing advanced LLMs as “judges” and formalizing prompt-based comparative workflows that retain CAT’s core principles (Sawicki et al., 26 Feb 2025).
1. Definitional Basis and Theoretical Foundations
The Consensual Assessment Technique prescribes that:
- Domain experts independently review artifacts (e.g., poems, artworks).
- Judgments are holistic, eschewing fixed subcriteria to minimize bias from rubric imposition.
- Numeric rating scales (minimum 3 points) are used for each output.
- All evaluators assess the identical set of items to allow direct score aggregation.
- Reliability of assessments is analyzed via inter-rater metrics, typically Cronbach’s α, but increasingly via ICC (Intraclass Correlation Coefficient).
CAT’s central tenet is that creativity is best operationalized by consensus among informed raters whose expertise permits implicit application of relevant evaluative standards. This conceptual architecture prioritizes subjective, context-dependent interpretation while allowing systematic aggregation and reliability quantification.
2. Adaptations for Computational Judging: LLM-CAT Protocol
In contemporary research, CAT has been adapted to compare human and computational judges via forced-choice and in-context rating strategies. Sawicki et al. (Sawicki et al., 26 Feb 2025) operationalized “LLM-CAT” with two advanced LLMs (Claude-3-Opus, GPT-4o) and a crowdsourced panel of non-expert humans:
- LLMs and humans rated each poem along five explicit criteria: holistic creativity, overall poetic quality, innovativeness (“not like other poems I have seen before”), similarity (converse of innovativeness), and poeticness (“this text is a poem”).
- For “Creativity” and “Quality,” prompts avoided decomposition, mirroring CAT’s “no set criteria” doctrine, so LLMs could apply their internal representations.
- Human crowdworkers were provided minimal domain training; LLMs were prompted with temperature=1.0 to induce output variability.
This adaptation preserves CAT’s holistic ethos and allows analysis of LLM reliability and performance relative to human panels.
3. Rating Scales, Forced-Choice Protocols, and Score Aggregation
Rating protocols include:
- LLMs: 1–5 integer scale per criterion (“1 = least X,” “5 = most X”), with forced-use of full scale per batch (min-max constraint).
- Two extraction modes from LLM output:
- Numeric scores (“3. Creativity Score …”).
- Rank-based scores derived from ordered lists (first item = batch size N points, decrementing).
- Human judges (Lamb et al. 2015) employed Likert-type scales over nine criteria with subsequent quantitative aggregation.
Score aggregation differs:
- For humans, criterion scores were averaged across all judges for each artifact.
- For LLMs, protocols included both single-run categorical assignment (ABC classification) and repeated batch evaluation with arithmetic mean aggregation per poem per criterion across multiple randomized appearances.
A summary table for LLM scoring protocols:
| Protocol | Scale | Aggregation |
|---|---|---|
| Numeric scoring | 1–5 | Mean over runs/batches |
| Rank-list | 15/90 | Ordered rank values |
The forced-choice “rank-list” prompt induces a comparative context, consistent with CAT’s principle that evaluation is best performed among competing samples.
4. Statistical Analysis and Reliability Quantification
Assessment of reliability and performance uses:
- Spearman’s Rank Correlation:
where is rank difference between LLM and ground truth; used to quantify alignment of LLM rankings with publication venue-based ground truth.
- One-way ANOVA:
Null hypothesis : ; batch means for categories A, B, C are tested, with Bonferroni-corrected .
- Intraclass Correlation Coefficient (ICC), following Shrout & Fleiss (1979):
ICC(1,k), ICC(2,k), ICC(3,k) calculated across repeated LLM evaluations. ICCs (0.90–0.99) indicate exceptionally high run-to-run consistency, even under stochastic sampling conditions (temperature=1.0).
These measures allow thorough quantification of inter-rater reliability and categorical discrimination, supporting rigorous evaluation of scoring method effectiveness.
5. Ground Truth, Comparative Baselines, and Quantitative Findings
Ground truth for “creativity” was defined as poem publication venue:
- “Good” (A): 30 poems from Poetry magazine.
- “Medium” (B): 30 poems from mid-tier magazines.
- “Bad” (C): 30 from an unmoderated amateur site.
Performance metric: Spearman’s between model-derived rankings and ground-truth category ordering.
Key results:
| Evaluator & Mode | Accuracy/% | Spearman’s | p-value |
|---|---|---|---|
| GPT-4o ABC (Exp1) | 52.2 | 0.62 | |
| Claude-3-Opus ABC (Exp1) | 60 | 0.57 | |
| Human “Novelty” (Lamb) | — | 0.38 | |
| Claude-3-Opus 15n “Quality” | — | 0.87 | |
| GPT-4o 15n “Innovative” | — | 0.75 |
All LLM correlations substantially exceed the best human criterion (Novelty, ). Highest reliability is observed for smaller batches (15 poems) and rank-based ordering. ICCs for LLMs consistently exceed 0.90, denoting run-to-run reliability.
6. Methodological Implications and Domain Transferability
CAT’s adaptation with LLMs demonstrates:
- LLMs matched or exceeded non-expert humans in alignment with publication-based ground truth under holistic creativity protocols when appropriately prompted.
- Reliability is maximized using smaller batches, enforced min-max scoring, and ordering-based evaluation.
- Rank-based scoring generally outperforms numeric aggregation for context-dependent tasks.
A plausible implication is that high-capacity LLMs, given well-structured forced-choice prompts and repeated randomized comparative contexts, can serve as reliable, scalable evaluators in creativity assessment tasks. This approach is extensible to other domains—short fiction, design artifacts, music—whenever externally validated ground truths (e.g., awards, publications, expert panels) are available for quantitative benchmarking.
7. Limitations, Scope, and Future Directions
While LLM-CAT matches or surpasses human panel performance in poetry within this protocol, several boundary conditions remain:
- The “expert” status of LLMs is operational but not grounded in domain credentials; this diverges from classic CAT, which relies on human expertise.
- CAT’s generalization depends on domain properties, artifact variability, and ground-truth robustness.
- Reliability is sensitive to batch size and prompt engineering; excessively large batches reduce discrimination accuracy, while forced-choice increases alignment.
- Extending CAT to entirely novel creative genres may require domain-specific calibration.
Broader implications suggest computational CAT implementations could transform large-scale creative evaluation, contingent upon careful prompt design, rigorous reliability metrics, and robust ground-truth construction (Sawicki et al., 26 Feb 2025).