Creative Beam Search (CBS)

Updated 8 September 2025

CBS is a response generation framework that integrates Diverse Beam Search with an in-model evaluation phase to enhance creativity in large language models.
It employs a two-stage process where diverse candidates are generated via a diversity-promoting beam search and then vetted by a model-based judge to select the most creative output.
Empirical evaluations show that CBS outputs are preferred in about 45% of cases compared to standard sampling, with only 29% matching the highest likelihood candidate, underscoring its creative advantage.

Creative Beam Search (CBS) is a response generation and validation framework designed to enhance the creativity and quality of outputs from LLMs. By integrating Diverse Beam Search to produce a wide spectrum of candidate responses and employing a LLM as a judge during a self-assessment phase, CBS mimics core aspects of the human creative process. The two-stage architecture—first generating diverse candidates and then selecting the best via model-internal evaluation—aims to bridge the gap between standard machine generation practices and human creativity, ensuring that the selected response is both diverse and possesses creative merit.

1. CBS Framework and Motivation

Creative Beam Search is motivated by the observation that standard machine generation, particularly when using likelihood-focused decoding methods, fails to capture the intentionality and creative process observed in humans. CBS addresses this by explicitly decomposing the process into two phases:

Response Generation: Producing multiple, diverse outputs via Diverse Beam Search, thereby emulating brainstorming or idea exploration.
Response Validation: Applying an LLM-as-a-Judge to evaluate and select among these candidates, simulating reflective self-assessment and quality filtering akin to human creative review.

This approach is distinct from conventional decoding, where the single highest likelihood output is chosen, often resulting in generic or repetitive responses.

2. Diverse Beam Search: Enhancing Candidate Diversity

Diverse Beam Search (DBS) is central to CBS’s candidate generation process. Traditional beam search partitions the beam budget into a single group, leading to candidate sequences that are minor variations of each other. In contrast, DBS splits the beam into $G$ groups, with each group producing $\frac{B}{G}$ sequences. At each step, candidate selection balances standard model likelihood with a diversity-promoting penalty, usually based on dissimilarity relative to candidates in other groups.

The objective for candidate $y$ can be expressed as:

$\text{Objective}(y) = \log P(y \mid x) - \lambda \cdot \text{DiversityPenalty}(y)$

where $\lambda$ is a diversity scaling parameter, and the penalty term encourages tokens at corresponding positions to differ across groups (frequently instantiated via Hamming dissimilarity). This mechanism directly boosts the semantic and surface variety of responses, expanding the search beyond near-duplicates.

3. LLM-as-a-Judge: Model-Intrinsic Evaluation and Voting

The response validation phase leverages the same or a similar LLM, repurposed as an evaluator rather than a generator. The process involves:

Listing the top $K$ generated candidates in a prompt.
Presenting $K$ separate evaluation prompts, each with a rotated candidate ordering to counteract positional bias.
In each prompt, the LLM votes for the most creative response.
Aggregating votes to select the final response, resolving ties by reverting to the initial DBS ranking.

This process ensures that the final output is not only likely or diverse but also rated highest for creativity by a model-driven selection criterion. The evaluation is not solely based on model log-likelihood but on qualitative criteria encoded within the LLM’s learned representation.

4. Qualitative Evaluation and Empirical Insights

The effectiveness of CBS was assessed through a qualitative experiment involving 31 graduate student judges and 217 comparative response assessments against standard sampling outputs (temperature and nucleus sampling). The findings are:

CBS-preferred outputs constituted approximately 45% of cases, outnumbering those from standard sampling algorithms.
In about 25% of cases, responses from both methods were deemed too similar, indicating inherent task limitations or convergence in model outputs.
Only 29% of final CBS outputs matched the highest likelihood candidate from DBS alone, a lower rate than would be expected by chance. This emphasizes that the LLM-as-a-Judge phase substantively shifts the selection away from mere likelihood maximization.

These results substantiate the claim that response validation is a non-trivial and necessary complement to multi-candidate generation.

5. Mathematical Formulation

In the DBS phase, candidate generation optimizes a composite objective at each decoding step:

$\max_{y \in \text{Candidates}} \left[ \log P(y | x) - \lambda \sum_t \sum_{y' \neq y} \delta(y_t, y'_t) \right]$

where $y_t$ denotes the token at time step $t$ in candidate $y$ , $\delta(a, b)$ is the Kronecker delta function, and the sums are over all other candidates at the same time step. The LLM-as-a-Judge phase aggregates votes over different orderings:

For each $i \in 1,\ldots, K$ , a prompt is constructed listing candidates in order $i$ to $K$ , $1$ to $i-1$ .
The candidate receiving the most total votes is returned.

This two-tier system ensures diversity not only in candidate pool composition but also in evaluation and selection, moving beyond simple argmax-based decoding.

6. Significance, Implications, and Applications

Creative Beam Search provides a systematic framework for elevating the creative quality of machine-generated responses. By structuring output selection into both diversity-optimized generation and explicit validation, CBS enables:

Substantially greater surface and semantic variation among model outputs.
Systematic filtering of candidates not just for feasibility but for attributes such as creativity, novelty, or fit-to-purpose, as defined by the evaluation prompt and model.
Application in domains including creative writing aids, ideation tools, co-creative systems in art or design, and other content-generation tasks where conventional sampling or beam search is unsatisfactory.

A plausible implication is that CBS can be extended to incorporate additional, task-adaptive evaluation criteria by modifying the LLM-as-a-Judge prompt, enabling further alignment with human preferences or domain-specific standards.

7. Limitations and Future Directions

Although CBS demonstrates marked improvements in controlled qualitative studies, several challenges remain:

The reliance on LLM self-evaluation may propagate biases intrinsic to the model itself, especially if not fine-tuned for evaluative tasks.
The approach incurs additional computational cost due to multiple rounds of generation and evaluation.
Quantitative metrics remain to be explored for measuring creative quality at scale.

Future work may address these limitations by integrating external or human-in-the-loop evaluation layers, optimizing diversity and validation metrics jointly during training, or generalizing the method for other modalities (e.g., images or code).

CBS thus stands as a principled advance toward intentional, model-guided creativity, combining structured search diversity and model-internal selection to approximate key aspects of human-inspired creative workflows (Franceschelli et al., 30 Apr 2024).

PDF Markdown Chat (Pro)

References (1)

Creative Beam Search: LLM-as-a-Judge For Improving Response Generation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Creative Beam Search (CBS).