CrIB: Benchmark for Combinational Creativity
- CrIB is a cross-domain testbed comprising 2,000 problems designed to evaluate AI's capacity for combinational p-creativity by recombining existing knowledge.
- It spans five distinct domains—painting, alien language, photobash, narrative, and dessert—each engineered to preclude rote memorization and require genuine invention.
- The benchmark rigorously compares performance against uncreative baselines using normalized scores, highlighting the challenges in achieving true computational innovation.
The Creative Invention Benchmark (CrIB) is a 2,000-problem, cross-domain testbed explicitly crafted to evaluate combinational p-creativity: the capacity of agents to invent artifacts novel to themselves by recombining existing knowledge. Unlike standard benchmarks that largely test recognition or direct recall, CrIB isolates the inventive aspect of cognition—its focus, design, metrics, and challenge domains are all engineered to resist brute-force, rote, or “memorization” strategies and instead demand authentic combinational invention (Guzdial et al., 2018).
1. Theoretical Foundation: Combinational p-Creativity
CrIB operationalizes Margaret Boden’s taxonomy of creativity, specifically p-creativity, which denotes novelty from the perspective of the individual agent, irrespective of absolute human novelty. CrIB further restricts the scope to combinational creativity, where new artifacts are constructed via recombination of known constituents. Formally, for a given knowledge base , a solution is both p-creative and combinational if and can be produced as for some constructive function .
Each CrIB problem instance provides:
- : Initial, domain-specific knowledge base (set of colors, words, images, graphs, or recipes).
- : Domain function for extending or transforming candidate solutions.
- : Domain-specific reset.
- : Oracle scoring function in measuring agreement with hidden target .
2. Benchmark Structure and Design Objectives
CrIB’s architecture addresses several key methodological goals:
- Cross-domain generality: Five distinct domains (painting, alien language, photobashing, narrative, and dessert recipes).
- Necessity of invention: Each target is out-of-K and requires the agent to synthesize elements of rather than select them wholesale.
- Scalability and rigor: 2,000 problems (400 per domain), sorted by proxy difficulty measures such as or the number of subcomponents.
- Preventing trivialization: Domains and problem generation preclude enumeration or preloading of exhaustive knowledge sets, ensuring invention is required.
This design parallels the community role of canonical datasets such as MNIST, aiming to galvanize comparable progress and focus creative AI research on a precisely scoped yet nontrivial challenge space (Guzdial et al., 2018).
3. Problem Domains and Generation Strategies
CrIB encompasses five problem types, each defined by distinct primitives, recombinatorial logics, and scoring functions:
| Domain | Elements | Application Function | Oracle Score () |
|---|---|---|---|
| Painting | 2–6 RGB colors | Paint pixel with color | 1–(L1 norm/image, normalized) |
| Alien Language | 3–9 “words” | Append word to sentence | Fraction matching words in order |
| Photobash | 2–9 images | Stamp image onto canvas | Painting metric |
| Narrative | 2–4 plot graphs | Submit candidate story | Prefix matching events |
| Dessert | 3–130 recipes | Submit candidate recipe | Ingredients overlap fraction |
Generation strategies follow the paradigm: enumerate or harvest a seed set; synthesize target outside by merging or splitting; form from relevant subcomponents and distractors; sort by difficulty proxies. Each domain’s targets are selected to maximize the necessity for recombinatorial action rather than retrieval or template matching.
4. Evaluation Metrics and Baseline Agents
Performance is assessed via two primary metrics in each domain:
- Null Score (): Agent applying no operations.
- Uncreative Max (): Best score by reuse only; i.e., choosing the singular point in most similar to .
To highlight genuine invention, final agent performance is normalized:
Agents performing no better than Uncreative Max achieve ; perfect matches yield .
Initial agent baselines include:
- Random Agent: Uniformly samples from and applies the domain function.
- Genetic Algorithms (GA, GA): Individual genomes represent full candidate solutions; mutation rate 0.7; uniform crossover; parent selection by tournament on . Population sizes and iteration counts are 100/100 and 1000/1000, respectively.
5. Quantitative Baseline Results
Baseline performance provides insight into both problem difficulty and opportunities for algorithmic advancement. Domain-level statistics:
| Domain | Null | Uncreative Max |
|---|---|---|
| Painting | 0.70 | 0.85 |
| Language | 0.00 | 0.72 |
| Photobash | 0.76 | 0.89 |
| Narrative | 0.00 | 0.45 |
| Dessert | 0.00 | 0.49 |
| Avg. | 0.29 | 0.61 |
Average normalized Scores for initial agents:
| Agent | Painting | Language | Photobash | Narrative | Dessert | Total |
|---|---|---|---|---|---|---|
| Random | –0.99 | –2.42 | –1.50 | –0.32 | –0.50 | –1.15 |
| GA | –0.99 | –1.41 | 0.02 | 0.76 | 0.35 | –0.25 |
| GA | –0.91 | –1.19 | 0.17 | 0.81 | 0.35 | –0.14 |
These results demonstrate that even sophisticated blackbox search (genetic algorithms over large populations and generations) struggles to outperform uncreative baselines, underscoring the intrinsic difficulty and nontrivial invention required by CrIB (Guzdial et al., 2018).
6. Limitations and Extensions
CrIB exhibits several inherent and intentional limitations:
- Scale: While 2,000 problems constitute a significant challenge set, the overall scale is modest relative to large vision/language datasets. Painting, language, and dessert domains can be expanded with minimal human labor; photobash and narrative demand labor-intensive human validation.
- Domain Coverage: CrIB currently excludes domains such as music or product design. Planned future iterations will broaden coverage to provide even more robust cross-domain evaluation of combinational p-creativity.
- Preventing Knowledge Leakage: The benchmark penalizes attempts to bypass recombinatorial invention through preloading exhaustive domain knowledge (e.g., hard-coded color mixtures). Ensuring the integrity of the inventive process is fundamental to CrIB’s objectives.
7. Community Impact and Future Directions
The release of CrIB supplies computational creativity research with a formalized benchmark and a public repository (https://github.com/mguzdial3/CrIB) for the rigorous evaluation of general-purpose inventive systems. The benchmark emphasizes reporting not only final normalized Scores but also the efficiency of creative reasoning as measured by knowledge base growth and the number of training steps required for solution. CrIB is positioned as a catalyst for advancing models that exhibit invention beyond mere pattern extraction or recall, thereby filling a critical gap in the current landscape of goal-driven AI benchmarks (Guzdial et al., 2018).