Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

LiveIdeaBench: LLM Ideation Benchmark

Updated 8 July 2025
  • LiveIdeaBench is a comprehensive benchmark suite that assesses large language models' creative and evaluative capabilities in scientific ideation and professional design.
  • It incorporates diverse evaluation methods, including minimal input prompting, dynamic panel assessments, and embedding-based metrics for quantitative idea analysis.
  • Its automated, multi-dimensional scoring system drives innovative improvements in research and design by delivering measurable, objective insights.

LiveIdeaBench is a comprehensive benchmark suite and evaluation methodology designed to assess and advance the capabilities of LLMs and generative systems in scientific and professional idea generation. It spans frameworks for divergent scientific ideation, professional design evaluation, automated idea assessment, and interactive creativity tools, with distinct implementations catering to various facets of the ideation and evaluation process.

1. Rationale and Scope

LiveIdeaBench addresses the growing need to systematically evaluate and enhance the creative, divergent, and evaluative powers of LLMs and generative models in domains such as science and professional design. While existing benchmarks typically assess LLMs with rich, context-heavy prompts or focus on problem-solving accuracy, LiveIdeaBench introduces a set of tasks and metrics explicitly aimed at measuring divergent thinking, creativity, and the quality of idea generation with minimal or professionally structured input (Ruan et al., 23 Dec 2024, Liang et al., 16 Dec 2024). It integrates methodologies for both qualitative and quantitative assessment, supports automation at scale, and is grounded in both established creativity theory and new mathematical frameworks.

2. Benchmark Design and Methodology

LiveIdeaBench implements multiple, specialized evaluation pipelines reflecting different aspects of idea generation and assessment:

2.1 Divergent Scientific Ideation with Minimal Context

  • Minimal Input Prompting: LiveIdeaBench prompts LLMs with single scientific keywords—eschewing rich prompting—to elicit ideas that reveal models’ raw creative capacity. This approach is inspired by Guilford’s theory of divergent thinking, emphasizing the capacity to generate multiple, varied, and novel ideas from sparse cues (Ruan et al., 23 Dec 2024).
  • Dynamic Panel Assessment: Generated ideas are automatically evaluated by a rotating panel of top-performing LLMs, selected such that the evaluated model never serves as its own judge.
  • Experimental Breadth: The latest iteration covers 1,180 high-impact scientific keywords across 18 (or 22, in other configurations) scientific domains, using over 40 leading models for both generation and assessment.

2.2 Professional Design and Visual Generative Tasks

  • IDEA-Bench Component: Focuses on the assessment of generative models’ ability to execute complex, professional design tasks—such as storyboarding, font generation, and image retouching—drawn from 100 real-world, professionally curated challenges. Each task can involve text, multiple reference images, or multimodal combinations (Liang et al., 16 Dec 2024).
  • Input-Output Modalities: Supports text-to-image, image-to-image, images-to-image, text-to-images, and combined image(s)-to-images inputs and outputs. Tasks are defined with lengthy and detailed prompts to mirror authentic professional standards.

2.3 Objective Idea Characterization

  • High-Dimensional Embedding and Quantitative Diversity Metrics: Implements a mathematical framework that maps ideas (text statements or design sketches) to high-dimensional vector spaces via LLM-derived embeddings (e.g., TE3 with 3072 dimensions). Objective measures of idea diversity—including distribution (via UMAP/DBSCAN clustering) and dispersion (via PCA eigenvalue analysis)—enable quantitative assessment and selection (Sankar et al., 11 Sep 2024).

2.4 Automated Scientific Idea Valuation

  • Representation-Based Assessment: Employs LLM hidden layer representations and downstream regressors (MLPs) trained on human-annotated datasets of scientific papers (e.g., ICLR reviews) to predict idea value or paper quality. This is shown to better capture idea merit than generative or summary-based evaluations (Xu et al., 7 Sep 2024).

3. Evaluation Metrics and Scoring Systems

LiveIdeaBench introduces multi-dimensional scoring to address the complex, multi-faceted nature of creativity and practical idea value.

3.1 Scientific Creativity Benchmark (LiveIdeaBench Proper)

  • Originality: Assessed through the mean judgment of multiple LLM "critics" against each generated idea, quantifying uniqueness and novelty compared to scientific discourse.
  • Feasibility: Reflects the degree to which ideas are practically and scientifically sound.
  • Fluency: Captures the diversity of ideas generated from the same prompt, utilizing a letter-grade to 10-point mapping (A=10.00 for high heterogeneity, D=0.00 for total overlap).
  • Flexibility: Measures domain consistency by taking the 30th percentile of the mean originality/feasibility/fluency scores across all tested keywords.
  • Clarity: While not a standalone metric, clarity is operationalized via succinctness and coherence constraints (target 100 words).

3.2 Professional Design Benchmark (IDEA-Bench)

  • Hierarchical Binary Judgments: For each design case, six binary questions are organized into three evaluation levels; failure at a fundamental level zeroes out all higher levels. Scoring is averaged per subtask, per category, and then globally.
  • Sample notation: For subtask scores,

Scoresubtask=16i=16St,i\text{Score}_{\text{subtask}} = \frac{1}{6} \sum_{i=1}^{6} S_{t,i}

where St,iS_{t,i} is the binary score for the iith question in task tt (Liang et al., 16 Dec 2024).

3.3 Idea Diversity and Distribution

  • Cluster Sparsity (CS) and Idea Sparsity (IS) define, respectively, how well-separated clusters of ideas are, and the within-cluster spread, based on UMAP/DBSCAN and convex hull area metrics:

Cluster Sparsity=1i=1NcAiAt\text{Cluster Sparsity} = 1 - \frac{\sum_{i=1}^{N_c} A_i}{A_t}

Idea Sparsity=AcNiexp(AcNi)\text{Idea Sparsity} = \frac{A_c}{N_i} \cdot \exp\left(-\frac{A_c}{N_i}\right)

where AiA_i and AcA_c are convex hull areas, NcN_c clusters, NiN_i ideas, AtA_t total idea space area (Sankar et al., 11 Sep 2024).

3.4 Automated Paper Scoring

  • LLM Representation Regression: Scores are predicted as si=Ac(Rep(di))s_i = A_c(\mathrm{Rep}(d_i)) where Rep(di)\mathrm{Rep}(d_i) is the deep representation for manuscript did_i, with loss minimized as:

L=1ni=1n(s^isi)2\mathcal{L} = \frac{1}{n} \sum_{i=1}^n (\hat{s}_i - s_i)^2

(Xu et al., 7 Sep 2024).

4. Key Results and Comparative Findings

  • Scientific idea generation quality as measured by LiveIdeaBench is not strongly predicted by standard general intelligence or problem-solving scores. For instance, QwQ-32B-preview demonstrates creative performance nearly matching the strongest proprietary models despite significant gaps in general AI benchmarks (Ruan et al., 23 Dec 2024).
  • Professional-grade design tasks remain challenging: the best generative model (FLUX-1) achieves an overall score of 22.48, far below professional standards, and generalist models typically score in the single digits (Liang et al., 16 Dec 2024).
  • Embedding-based quantitative assessments align well with expert designer judgments: clusters defined via DBSCAN-UMAP on idea vectors correspond strongly to expert similarity labels, indicating that computational metrics capture much of the semantic variance perceived by humans (Sankar et al., 11 Sep 2024).
  • Representation-based evaluators trained on LLM hidden states can predict peer-review scores for scientific papers with high fidelity (over 86% of scores within two points of human annotators), outperforming generation-based approaches and static benchmarks like SciBERT (Xu et al., 7 Sep 2024).

5. Implementation, Automation, and Resources

  • Dynamic Automation: The evaluation process is automated at scale. In scientific creativity evaluation, a rotating panel of LLMs—periodically updated—serves as independent critics, and the scientific keyword pool is regularly refreshed from real-time analytics to ensure contemporary relevance (Ruan et al., 23 Dec 2024).
  • Auto-Evaluation in Design: IDEA-Bench leverages multimodal LLMs (e.g., Gemini 1.5 Pro) to automatically rephrase, judge, and score generated images against defined criteria. Each result is scored three times and then averaged for objectivity (Liang et al., 16 Dec 2024).
  • Open Datasets and Toolkits: Benchmarks, task datasets, evaluation toolkits, and leaderboards supporting reproducible evaluation and comparison are made available through open repositories such as https://github.com/ali-vilab/IDEA-Bench (Liang et al., 16 Dec 2024).
  • Mathematical Tool Integration: The use of PCA, UMAP, and DBSCAN in idea space analysis allows for efficient and interpretable automation of idea diversity selection, reducing reliance on expert judgment and augmenting novice designers' decision-making (Sankar et al., 11 Sep 2024).

6. Implications, Challenges, and Future Directions

  • LLM Training and Benchmarking: LiveIdeaBench demonstrates that divergent thinking and scientific creativity constitute a separate axis of model evaluation, distinct from general intelligence, reasoning, or coding skills. Improving LLM creative performance may require targeted datasets and objectives, beyond those for general problem-solving (Ruan et al., 23 Dec 2024).
  • Evaluation Limitations: Current benchmarks are domain-specific (e.g., focusing primarily on computer science papers for automated reviewer tasks, (Xu et al., 7 Sep 2024)) and may require adaptation for broader disciplinary coverage.
  • Professional Design Gap: Contemporary generative models remain markedly short of professional designer standards, especially in multimodal, multi-image, and style- or identity-preservation tasks (Liang et al., 16 Dec 2024).
  • Ethical and Practical Balancing: There is an emerging need for evaluation systems that enforce ethical constraints (preventing harmful idea generation) without unnecessarily restricting creative output (Ruan et al., 23 Dec 2024).
  • Adaptivity and Relevance: The benchmarks’ design—dynamic keyword/judge panel updates; open-ended but structured scoring frameworks—ensures persistent contemporariness and usefulness for ongoing model development and scientific research support.

7. Broader Impact and Connections

LiveIdeaBench provides a comprehensive, empirically grounded, and open framework for assessing, comparing, and driving forward LLM development in creative, scientific, and professional settings. It influences both AI research and practical tool-building for scientific discovery, design ideation, and objective peer review, and is complemented by related efforts such as IdeaBench (Guo et al., 31 Oct 2024) and hybrid mathematical-evaluation frameworks (Sankar et al., 11 Sep 2024, Xu et al., 7 Sep 2024). Its automated multidimensional assessment approach is poised to facilitate advances in AI-driven hypothesis generation, interdisciplinary exploration, and next-generation intelligent design systems.