LiveIdeaBench: LLM Ideation Benchmark

Updated 8 July 2025

LiveIdeaBench is a comprehensive benchmark suite that assesses large language models' creative and evaluative capabilities in scientific ideation and professional design.
It incorporates diverse evaluation methods, including minimal input prompting, dynamic panel assessments, and embedding-based metrics for quantitative idea analysis.
Its automated, multi-dimensional scoring system drives innovative improvements in research and design by delivering measurable, objective insights.

LiveIdeaBench is a comprehensive benchmark suite and evaluation methodology designed to assess and advance the capabilities of LLMs and generative systems in scientific and professional idea generation. It spans frameworks for divergent scientific ideation, professional design evaluation, automated idea assessment, and interactive creativity tools, with distinct implementations catering to various facets of the ideation and evaluation process.

1. Rationale and Scope

LiveIdeaBench addresses the growing need to systematically evaluate and enhance the creative, divergent, and evaluative powers of LLMs and generative models in domains such as science and professional design. While existing benchmarks typically assess LLMs with rich, context-heavy prompts or focus on problem-solving accuracy, LiveIdeaBench introduces a set of tasks and metrics explicitly aimed at measuring divergent thinking, creativity, and the quality of idea generation with minimal or professionally structured input (2412.17596, 2412.11767). It integrates methodologies for both qualitative and quantitative assessment, supports automation at scale, and is grounded in both established creativity theory and new mathematical frameworks.

2. Benchmark Design and Methodology

LiveIdeaBench implements multiple, specialized evaluation pipelines reflecting different aspects of idea generation and assessment:

2.1 Divergent Scientific Ideation with Minimal Context

Minimal Input Prompting: LiveIdeaBench prompts LLMs with single scientific keywords—eschewing rich prompting—to elicit ideas that reveal models’ raw creative capacity. This approach is inspired by Guilford’s theory of divergent thinking, emphasizing the capacity to generate multiple, varied, and novel ideas from sparse cues (2412.17596).
Dynamic Panel Assessment: Generated ideas are automatically evaluated by a rotating panel of top-performing LLMs, selected such that the evaluated model never serves as its own judge.
Experimental Breadth: The latest iteration covers 1,180 high-impact scientific keywords across 18 (or 22, in other configurations) scientific domains, using over 40 leading models for both generation and assessment.

2.2 Professional Design and Visual Generative Tasks

IDEA-Bench Component: Focuses on the assessment of generative models’ ability to execute complex, professional design tasks—such as storyboarding, font generation, and image retouching—drawn from 100 real-world, professionally curated challenges. Each task can involve text, multiple reference images, or multimodal combinations (2412.11767).
Input-Output Modalities: Supports text-to-image, image-to-image, images-to-image, text-to-images, and combined image(s)-to-images inputs and outputs. Tasks are defined with lengthy and detailed prompts to mirror authentic professional standards.

2.3 Objective Idea Characterization

High-Dimensional Embedding and Quantitative Diversity Metrics: Implements a mathematical framework that maps ideas (text statements or design sketches) to high-dimensional vector spaces via LLM-derived embeddings (e.g., TE3 with 3072 dimensions). Objective measures of idea diversity—including distribution (via UMAP/DBSCAN clustering) and dispersion (via PCA eigenvalue analysis)—enable quantitative assessment and selection (2409.07578).

2.4 Automated Scientific Idea Valuation

Representation-Based Assessment: Employs LLM hidden layer representations and downstream regressors (MLPs) trained on human-annotated datasets of scientific papers (e.g., ICLR reviews) to predict idea value or paper quality. This is shown to better capture idea merit than generative or summary-based evaluations (2409.13712).

3. Evaluation Metrics and Scoring Systems

LiveIdeaBench introduces multi-dimensional scoring to address the complex, multi-faceted nature of creativity and practical idea value.

3.1 Scientific Creativity Benchmark (LiveIdeaBench Proper)

Originality: Assessed through the mean judgment of multiple LLM "critics" against each generated idea, quantifying uniqueness and novelty compared to scientific discourse.
Feasibility: Reflects the degree to which ideas are practically and scientifically sound.
Fluency: Captures the diversity of ideas generated from the same prompt, utilizing a letter-grade to 10-point mapping (A=10.00 for high heterogeneity, D=0.00 for total overlap).
Flexibility: Measures domain consistency by taking the 30th percentile of the mean originality/feasibility/fluency scores across all tested keywords.
Clarity: While not a standalone metric, clarity is operationalized via succinctness and coherence constraints (target 100 words).

3.2 Professional Design Benchmark (IDEA-Bench)

Hierarchical Binary Judgments: For each design case, six binary questions are organized into three evaluation levels; failure at a fundamental level zeroes out all higher levels. Scoring is averaged per subtask, per category, and then globally.
Sample notation: For subtask scores,

$\text{Score}_{\text{subtask}} = \frac{1}{6} \sum_{i=1}^{6} S_{t,i}$

where $S_{t,i}$ is the binary score for the $i$ th question in task $t$ (2412.11767).

3.3 Idea Diversity and Distribution

Cluster Sparsity (CS) and Idea Sparsity (IS) define, respectively, how well-separated clusters of ideas are, and the within-cluster spread, based on UMAP/DBSCAN and convex hull area metrics:

$\text{Cluster Sparsity} = 1 - \frac{\sum_{i=1}^{N_c} A_i}{A_t}$

$\text{Idea Sparsity} = \frac{A_c}{N_i} \cdot \exp\left(-\frac{A_c}{N_i}\right)$

where $A_i$ and $A_c$ are convex hull areas, $N_c$ clusters, $N_i$ ideas, $A_t$ total idea space area (2409.07578).

3.4 Automated Paper Scoring

LLM Representation Regression: Scores are predicted as $s_i = A_c(\mathrm{Rep}(d_i))$ where $\mathrm{Rep}(d_i)$ is the deep representation for manuscript $d_i$ , with loss minimized as:

$\mathcal{L} = \frac{1}{n} \sum_{i=1}^n (\hat{s}_i - s_i)^2$

(2409.13712).

4. Key Results and Comparative Findings

Scientific idea generation quality as measured by LiveIdeaBench is not strongly predicted by standard general intelligence or problem-solving scores. For instance, QwQ-32B-preview demonstrates creative performance nearly matching the strongest proprietary models despite significant gaps in general AI benchmarks (2412.17596).
Professional-grade design tasks remain challenging: the best generative model (FLUX-1) achieves an overall score of 22.48, far below professional standards, and generalist models typically score in the single digits (2412.11767).
Embedding-based quantitative assessments align well with expert designer judgments: clusters defined via DBSCAN-UMAP on idea vectors correspond strongly to expert similarity labels, indicating that computational metrics capture much of the semantic variance perceived by humans (2409.07578).
Representation-based evaluators trained on LLM hidden states can predict peer-review scores for scientific papers with high fidelity (over 86% of scores within two points of human annotators), outperforming generation-based approaches and static benchmarks like SciBERT (2409.13712).

5. Implementation, Automation, and Resources

Dynamic Automation: The evaluation process is automated at scale. In scientific creativity evaluation, a rotating panel of LLMs—periodically updated—serves as independent critics, and the scientific keyword pool is regularly refreshed from real-time analytics to ensure contemporary relevance (2412.17596).
Auto-Evaluation in Design: IDEA-Bench leverages multimodal LLMs (e.g., Gemini 1.5 Pro) to automatically rephrase, judge, and score generated images against defined criteria. Each result is scored three times and then averaged for objectivity (2412.11767).
Open Datasets and Toolkits: Benchmarks, task datasets, evaluation toolkits, and leaderboards supporting reproducible evaluation and comparison are made available through open repositories such as https://github.com/ali-vilab/IDEA-Bench (2412.11767).
Mathematical Tool Integration: The use of PCA, UMAP, and DBSCAN in idea space analysis allows for efficient and interpretable automation of idea diversity selection, reducing reliance on expert judgment and augmenting novice designers' decision-making (2409.07578).

6. Implications, Challenges, and Future Directions

LLM Training and Benchmarking: LiveIdeaBench demonstrates that divergent thinking and scientific creativity constitute a separate axis of model evaluation, distinct from general intelligence, reasoning, or coding skills. Improving LLM creative performance may require targeted datasets and objectives, beyond those for general problem-solving (2412.17596).
Evaluation Limitations: Current benchmarks are domain-specific (e.g., focusing primarily on computer science papers for automated reviewer tasks, (2409.13712)) and may require adaptation for broader disciplinary coverage.
Professional Design Gap: Contemporary generative models remain markedly short of professional designer standards, especially in multimodal, multi-image, and style- or identity-preservation tasks (2412.11767).
Ethical and Practical Balancing: There is an emerging need for evaluation systems that enforce ethical constraints (preventing harmful idea generation) without unnecessarily restricting creative output (2412.17596).
Adaptivity and Relevance: The benchmarks’ design—dynamic keyword/judge panel updates; open-ended but structured scoring frameworks—ensures persistent contemporariness and usefulness for ongoing model development and scientific research support.

7. Broader Impact and Connections

LiveIdeaBench provides a comprehensive, empirically grounded, and open framework for assessing, comparing, and driving forward LLM development in creative, scientific, and professional settings. It influences both AI research and practical tool-building for scientific discovery, design ideation, and objective peer review, and is complemented by related efforts such as IdeaBench (2411.02429) and hybrid mathematical-evaluation frameworks (2409.07578, 2409.13712). Its automated multidimensional assessment approach is poised to facilitate advances in AI-driven hypothesis generation, interdisciplinary exploration, and next-generation intelligent design systems.