CrIB: Benchmark for Combinational Creativity

Updated 21 December 2025

CrIB is a cross-domain testbed comprising 2,000 problems designed to evaluate AI's capacity for combinational p-creativity by recombining existing knowledge.
It spans five distinct domains—painting, alien language, photobash, narrative, and dessert—each engineered to preclude rote memorization and require genuine invention.
The benchmark rigorously compares performance against uncreative baselines using normalized scores, highlighting the challenges in achieving true computational innovation.

The Creative Invention Benchmark (CrIB) is a 2,000-problem, cross-domain testbed explicitly crafted to evaluate combinational p-creativity: the capacity of agents to invent artifacts novel to themselves by recombining existing knowledge. Unlike standard benchmarks that largely test recognition or direct recall, CrIB isolates the inventive aspect of cognition—its focus, design, metrics, and challenge domains are all engineered to resist brute-force, rote, or “memorization” strategies and instead demand authentic combinational invention (Guzdial et al., 2018).

1. Theoretical Foundation: Combinational p-Creativity

CrIB operationalizes Margaret Boden’s taxonomy of creativity, specifically p-creativity, which denotes novelty from the perspective of the individual agent, irrespective of absolute human novelty. CrIB further restricts the scope to combinational creativity, where new artifacts are constructed via recombination of known constituents. Formally, for a given knowledge base $K$ , a solution $s$ is both p-creative and combinational if $s \notin K$ and $s$ can be produced as $s = f(K)$ for some constructive function $f$ .

Each CrIB problem instance provides:

$K$ : Initial, domain-specific knowledge base (set of colors, words, images, graphs, or recipes).
$apply(f, \cdot)$ : Domain function for extending or transforming candidate solutions.
$clear(\cdot)$ : Domain-specific reset.
$Score_0(\cdot)$ : Oracle scoring function in $[0,1]$ measuring agreement with hidden target $t$ .

2. Benchmark Structure and Design Objectives

CrIB’s architecture addresses several key methodological goals:

Cross-domain generality: Five distinct domains (painting, alien language, photobashing, narrative, and dessert recipes).
Necessity of invention: Each target is out-of-K and requires the agent to synthesize elements of $K$ rather than select them wholesale.
Scalability and rigor: 2,000 problems (400 per domain), sorted by proxy difficulty measures such as $|K|$ or the number of subcomponents.
Preventing trivialization: Domains and problem generation preclude enumeration or preloading of exhaustive knowledge sets, ensuring invention is required.

This design parallels the community role of canonical datasets such as MNIST, aiming to galvanize comparable progress and focus creative AI research on a precisely scoped yet nontrivial challenge space (Guzdial et al., 2018).

3. Problem Domains and Generation Strategies

CrIB encompasses five problem types, each defined by distinct primitives, recombinatorial logics, and scoring functions:

Domain	$K$ Elements	Application Function	Oracle Score ( $Score_0$ )
Painting	2–6 RGB colors	Paint pixel with color	1–(L1 norm/image, normalized)
Alien Language	3–9 “words”	Append word to sentence	Fraction matching words in order
Photobash	2–9 images	Stamp image onto canvas	Painting metric
Narrative	2–4 plot graphs	Submit candidate story	Prefix matching events
Dessert	3–130 recipes	Submit candidate recipe	Ingredients overlap fraction

Generation strategies follow the paradigm: enumerate or harvest a seed set; synthesize target $t$ outside $K$ by merging or splitting; form $K$ from relevant subcomponents and distractors; sort by difficulty proxies. Each domain’s targets are selected to maximize the necessity for recombinatorial action rather than retrieval or template matching.

4. Evaluation Metrics and Baseline Agents

Performance is assessed via two primary metrics in each domain:

Null Score ( $NScore$ ): Agent applying no operations.
Uncreative Max ( $NScore_u$ ): Best score by reuse only; i.e., choosing the singular point in $K$ most similar to $t$ .

To highlight genuine invention, final agent performance is normalized:

$Score(a) = \frac{NScore_a - NScore_u}{1 - NScore_u}$

Agents performing no better than Uncreative Max achieve $Score \leq 0$ ; perfect matches yield $Score=1$ .

Initial agent baselines include:

Random Agent: Uniformly samples from $K$ and applies the domain function.
Genetic Algorithms (GA $_{100}$ , GA $_{1000}$ ): Individual genomes represent full candidate solutions; mutation rate 0.7; uniform crossover; parent selection by tournament on $NScore$ . Population sizes and iteration counts are 100/100 and 1000/1000, respectively.

5. Quantitative Baseline Results

Baseline performance provides insight into both problem difficulty and opportunities for algorithmic advancement. Domain-level statistics:

Domain	Null	Uncreative Max
Painting	0.70	0.85
Language	0.00	0.72
Photobash	0.76	0.89
Narrative	0.00	0.45
Dessert	0.00	0.49
Avg.	0.29	0.61

Average normalized Scores for initial agents:

Agent	Painting	Language	Photobash	Narrative	Dessert	Total
Random	–0.99	–2.42	–1.50	–0.32	–0.50	–1.15
GA $_{100}$	–0.99	–1.41	0.02	0.76	0.35	–0.25
GA $_{1000}$	–0.91	–1.19	0.17	0.81	0.35	–0.14

These results demonstrate that even sophisticated blackbox search (genetic algorithms over large populations and generations) struggles to outperform uncreative baselines, underscoring the intrinsic difficulty and nontrivial invention required by CrIB (Guzdial et al., 2018).

6. Limitations and Extensions

CrIB exhibits several inherent and intentional limitations:

Scale: While 2,000 problems constitute a significant challenge set, the overall scale is modest relative to large vision/language datasets. Painting, language, and dessert domains can be expanded with minimal human labor; photobash and narrative demand labor-intensive human validation.
Domain Coverage: CrIB currently excludes domains such as music or product design. Planned future iterations will broaden coverage to provide even more robust cross-domain evaluation of combinational p-creativity.
Preventing Knowledge Leakage: The benchmark penalizes attempts to bypass recombinatorial invention through preloading exhaustive domain knowledge (e.g., hard-coded color mixtures). Ensuring the integrity of the inventive process is fundamental to CrIB’s objectives.

7. Community Impact and Future Directions

The release of CrIB supplies computational creativity research with a formalized benchmark and a public repository (https://github.com/mguzdial3/CrIB) for the rigorous evaluation of general-purpose inventive systems. The benchmark emphasizes reporting not only final normalized Scores but also the efficiency of creative reasoning as measured by knowledge base growth and the number of training steps required for solution. CrIB is positioned as a catalyst for advancing models that exhibit invention beyond mere pattern extraction or recall, thereby filling a critical gap in the current landscape of goal-driven AI benchmarks (Guzdial et al., 2018).

PDF Markdown Chat (Pro)

References (1)

Creative Invention Benchmark (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Creative Invention Benchmark (CrIB).