Prompting Diverse Ideas: Increasing AI Idea Variance

Published 27 Jan 2024 in cs.CY and cs.AI | (2402.01727v1)

Abstract: Unlike routine tasks where consistency is prized, in creativity and innovation the goal is to create a diverse set of ideas. This paper delves into the burgeoning interest in employing AI to enhance the productivity and quality of the idea generation process. While previous studies have found that the average quality of AI ideas is quite high, prior research also has pointed to the inability of AI-based brainstorming to create sufficient dispersion of ideas, which limits novelty and the quality of the overall best idea. Our research investigates methods to increase the dispersion in AI-generated ideas. Using GPT-4, we explore the effect of different prompting methods on Cosine Similarity, the number of unique ideas, and the speed with which the idea space gets exhausted. We do this in the domain of developing a new product development for college students, priced under $50. In this context, we find that (1) pools of ideas generated by GPT-4 with various plausible prompts are less diverse than ideas generated by groups of human subjects (2) the diversity of AI generated ideas can be substantially improved using prompt engineering (3) Chain-of-Thought (CoT) prompting leads to the highest diversity of ideas of all prompts we evaluated and was able to come close to what is achieved by groups of human subjects. It also was capable of generating the highest number of unique ideas of any prompt we studied.

Abstract PDF Upgrade to Chat

Citations (10)

View on Semantic Scholar

Summary

The paper confirms that specialized prompting, particularly chain-of-thought, significantly enhances AI idea diversity by lowering cosine similarity scores toward human levels.
The paper systematically evaluates 35 prompting strategies on GPT-4 using metrics like cosine similarity and unique idea counts to quantify diversity improvements.
The paper highlights limitations such as reliance on cosine similarity for novelty and challenges in generalizing findings beyond the specific consumer product domain.

Alright, I need to summarize this paper about increasing AI-generated idea diversity using prompt engineering. Let me start by understanding the key points. The authors compare different prompting strategies with GPT-4 to improve idea variance, measured by cosine similarity, unique ideas count, and exhaustion speed.

First, the main findings: they confirmed that AI without specific prompts is less diverse than humans (H1). Then, prompt engineering helps (H2), especially Chain-of-Thought (CoT) which almost matches human diversity (H3). The methods include 35 prompts across categories like personas, creativity techniques, and CoT. They used cosine similarity from embeddings, which I should note as a potential limitation since it's not perfect.

I need to structure the summary with technical terms and avoid sensational language. Highlight numerical results like cosine scores (human 0.243 vs CoT 0.255), unique ideas (3700 to 4700), and exhaustion curves. Mention the hybrid approach's low overlap between strategies. Also, address limitations like cosine's validity and generalization beyond GPT-4 and the specific product domain. Make sure not to use "our" or "recent," just state facts. Use lists and bold for key terms and acronyms. Check that I'm not missing any major sections like methodology or implications. Okay, let's put this together.

The paper "Prompting Diverse Ideas: Increasing AI Idea Variance" investigates methods to enhance the diversity of ideas generated by LLM-based systems, focusing on GPT-4 in the context of product innovation for college students priced under $50. The study addresses three hypotheses:

H1: AI-generated idea pools without specialized prompting exhibit lower diversity than human-generated pools.
H2: Prompt engineering significantly improves AI idea diversity.
H3: Chain-of-Thought (CoT) prompting outperforms other strategies in maximizing diversity.

Key Findings

Baseline Diversity Comparison:
- Human-generated ideas (aggregated from 100 MBA students) achieved a cosine similarity of 0.243, indicating higher diversity.
- Unprompted GPT-4 ideas had cosine similarities ranging from 0.255 to 0.432, with the base prompt scoring 0.377.
Prompt Engineering Efficacy:
- CoT Prompting yielded the lowest cosine similarity (0.255), approaching human-level diversity. It increased the estimated number of unique ideas from ~3,700 (base prompt) to ~4,700.
- Persona-Based Prompts (e.g., "Steve Jobs") achieved moderate improvements (cosine similarity: 0.368), while creativity frameworks like the Harvard Business Review methodology underperformed (0.387).
- Hybrid strategies combining multiple prompts showed low inter-pool cosine similarity (0.2–0.6), suggesting complementary idea spaces.
Exhaustion Dynamics:
- CoT maintained lower similarity scores than the base prompt until ~750 ideas, after which both strategies converged due to idea space depletion.

Methodology

Metrics:
- Cosine Similarity: Computed using Google’s Universal Sentence Encoder embeddings.
- Unique Idea Count: Estimated via a mark-recapture model:
- $u = \frac{1}{a} \left(1 - e^{-aN}\right)$
- where $u$ = unique ideas, $N$ = total ideas, and $a$ = discovery rate parameter.
- Speed of Exhaustion: Tracked via exponentially smoothed similarity scores over sequential idea generation.
Experimental Setup:
- 35 prompting strategies tested across 10 sessions (1,000 ideas per strategy).
- Temperature = 0.7, top-p = 1.0, no frequency/presence penalties.

Limitations

Cosine Similarity: May not fully align with human judgments of novelty.
Domain Specificity: Results are tied to consumer products for college students; generalizability to other domains (e.g., B2B innovation) remains untested.
Quality-Diversity Tradeoff: High diversity does not guarantee feasibility or market viability.

Implications

Practical Applications: CoT and hybrid prompting are recommended for ideation tasks requiring exploration of rugged solution landscapes.
Theoretical Contribution: Demonstrates that LLM output diversity is highly sensitive to prompt design, validating the "explore-exploit" framework in AI-aided innovation.

The study underscores the potential of LLM systems to augment human creativity when guided by structured prompting strategies, though human groups retain a marginal advantage in raw diversity. Future work should validate these findings with alternative LLM architectures and real-world idea selection processes.

Markdown