Large Language Models show both individual and collective creativity comparable to humans (2412.03151v1)

Published 4 Dec 2024 in cs.AI

Abstract: Artificial intelligence has, so far, largely automated routine tasks, but what does it mean for the future of work if LLMs show creativity comparable to humans? To measure the creativity of LLMs holistically, the current study uses 13 creative tasks spanning three domains. We benchmark the LLMs against individual humans, and also take a novel approach by comparing them to the collective creativity of groups of humans. We find that the best LLMs (Claude and GPT-4) rank in the 52nd percentile against humans, and overall LLMs excel in divergent thinking and problem solving but lag in creative writing. When questioned 10 times, an LLM's collective creativity is equivalent to 8-10 humans. When more responses are requested, two additional responses of LLMs equal one extra human. Ultimately, LLMs, when optimally applied, may compete with a small group of humans in the future of work.

Authors (9)

Luning Sun (14 papers)
Yuzhuo Yuan (2 papers)
Yuan Yao (292 papers)
Yanyan Li (86 papers)
Hao Zhang (948 papers)
Xing Xie (220 papers)
Xiting Wang (42 papers)
Fang Luo (6 papers)
David Stillwell (8 papers)

Summary

Evaluation of LLMs' Creative Capabilities

The paper authored by Luning Sun and colleagues investigates the creativity of LLMs, namely GPT-3.5, GPT-4, Claude, Qwen, and SparkDesk, using a comprehensive suite of tasks to benchmark them against human participants. By employing a multi-faceted approach, this paper assesses both the individual and collective creativity of these models in various domains including divergent thinking, problem solving, and creative writing. Crucially, the paper extends existing literature by considering collective creativity, a novel approach in LLM evaluation.

In their methodology, the authors apply 13 distinct creative tasks across three domains to both human participants and LLMs. The results reveal that while LLMs generally outperform humans on problem solving and divergent thinking tasks, they lag significantly behind in the domain of creative writing. For instance, GPT-4 achieves the 52nd percentile when benchmarked against human performance across tasks, with notable strength in divergent thinking and problem solving where it ranks above the 50th percentile. Conversely, creative writing remains a challenge, with models ranking on average in the 25th percentile, reflecting the complexity and nuanced nature of this domain which seems to be less amenable to current LLM capabilities.

The concept of collective creativity emerges as a significant area of interest in the paper. Here, the authors explore the scenario of pooling multiple responses from an LLM and comparing the output to human groups. They find that Claude and GPT-4, when queried multiple times, exhibit collective creativity comparable to a group of 10 humans in problem solving tasks. This is particularly noteworthy given the cost efficiency and rapid output generation capability of LLMs compared to human brainstorming sessions. However, the paper also notes that the collective strength of LLMs in creative writing is weak, requiring a disproportionate number of responses to rival a single human.

Furthermore, the paper highlights that the diversity within and between LLM responses remains inferior to that of human participants. LLM-generated outputs tend to have less variance, which could hinder diverse creative ideation, a crucial element in fields requiring high variability in output.

The implications of this research are substantial, not only in benchmarking LLMs but also in informing practical applications within industries reliant on creative processes. As the technology advances, LLMs could potentially take on roles that currently require small teams of humans, provided the contexts are suitably aligned with the models' strengths in generating novel and useful ideas in domains like problem solving.

Future research, as suggested by the authors, is likely to explore the integration of LLMs within multi-agent systems to simulate interactive creativity, which could enhance the utility of LLM outputs in collective creative tasks. Additionally, the potential to refine LLM training to improve creative writing capabilities remains a critical area for further exploration.

In conclusion, this paper provides a nuanced view of LLM creativity, pressing for more refined models or hybrid human-AI collaborations to optimize creative tasks in various occupational environments, while highlighting the current limitations that regulate their applications, particularly in domains demanding high creativity and originality.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JagersbergKnut/status/1875098402217988177

https://twitter.com/david_stillwell/status/1886061101596487943

YouTube

Show All Videos