CP-Bench: Constraint Programming Benchmark Set

Updated 17 August 2025

CP-Bench is a comprehensive dataset that aggregates 101 diverse combinatorial problems to support rigorous evaluation of both classical solvers and LLM-driven modelling.
It encompasses various modelling frameworks, including MiniZinc, CPMpy, and OR-Tools, enabling cross-framework comparisons through standardized, reproducible experiments.
The benchmark facilitates practical assessments of solver performance and automated constraint modelling by leveraging detailed problem metadata and output-level evaluation protocols.

CP-Bench Constraint Programming Benchmark Set is a comprehensive, structured resource designed to support the development, evaluation, and comparison of constraint programming solutions—including solvers, modelling assistants, and automated systems—across a wide spectrum of combinatorial problem classes and modelling frameworks. CP-Bench fulfills two closely related but distinct roles: it operates both as a repository of diverse, rigorously specified constraint satisfaction and optimization problems for benchmarking solvers and as an explicit evaluation suite for measuring the modelling capabilities of automated agents or LLMs.

1. Definition, Scope, and Motivation

CP-Bench refers specifically to benchmark datasets and problem instance collections that facilitate rigorous experimental assessment of constraint programming (CP) systems. Its core goal is to capture the diversity, complexity, and modelling variety found in real-world combinatorial problems, supporting standardized, reproducible experiments for both solver developers and researchers in automated CP modelling.

CP-Bench explicitly addresses the limitations of prior datasets, which were frequently small, domain-specific, or syntactically homogeneous, and thus failed to reflect the broad range of abstractions, encodings, and constraints encountered in practical CP applications (Michailidis et al., 6 Jun 2025). By providing large, curated, and heterogenous problem suites, CP-Bench enables meaningful performance and modelling comparisons across languages (e.g., MiniZinc, Python/CPMpy, OR-Tools CP-SAT), abstraction levels, and automation pipelines.

2. Dataset Composition, Sources, and Characteristics

The composition of CP-Bench emphasizes breadth across problem classes, abstraction levels, and constraint varieties. The main dataset introduced in (Michailidis et al., 6 Jun 2025) contains 101 combinatorial problems encompassing both constraint satisfaction (CSP) and optimization (COP) instances. Key features include:

Source Diversity: Problems are aggregated from CSPLib, CPMpy examples, Håkan Kjellerstrand’s repository, and academic course material. This ensures inclusion of classical, industrial, and research-driven CP benchmarks.
Model and Instance Variety:
- Decision variable counts span from as low as 3 to nearly 1,000 per instance.
- Constraints per instance range from 1 to over 2,000.
- 241 unique constraint types are represented, capturing an extensive spectrum of CP modelling constructs (including global and intensional constraints).
Problem Metadata: Each instance is enriched with metadata, the original natural language problem statement, ground-truth executable model(s) (notably in CPMpy), and optional sample data for reproducibility and solution verification (Michailidis et al., 6 Jun 2025).
Evaluation-Oriented Structuring: Instances are formatted for direct ingestion by modelling assistants and solvers, with output-based evaluation protocols that do not require mapping internal model variables.

This diversity reflects the aim to test both expressive power and modelling capabilities across frameworks and to stress robustness in both classical CP solvers and generative agents.

3. Evaluation Methodologies and Systematic Assessment

CP-Bench supports a variety of experimental protocols for both solver benchmarking and automated modelling evaluation:

Solver Performance: For traditional CP benchmarking, instances are executed within model-specific format expectations (MiniZinc, XCSP3, CPMpy, OR-Tools, etc.), and solvers are compared on feasibility, solution quality (for COPs), optimality gap, runtime, and additional criteria depending on the instance (Boussemart et al., 2016, Col et al., 2019, Lan et al., 19 Feb 2025).
Automated Modelling and LLMs: CP-Bench assesses natural language to CP model translation by LLMs and agents. Rather than matching generated symbolic code with reference models, correctness is determined via output-level equivalence—i.e., does the generated model yield a feasible solution (and correct optimum, for COPs) when solved (Michailidis et al., 6 Jun 2025, Szeider, 10 Aug 2025). The formal metric is:

$SA(\hat{a}) = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(\hat{a}_i \in Sol(\mathcal{C}_i) \wedge f_i(\hat{a}_i) = f_i(a_i^*))$

where $Sol(\mathcal{C}_i)$ denotes the solution set, $f_i$ is the objective (if any), and $a_i^*$ is the reference solution.

Prompt Engineering and Inference Protocols: Experiments systematically vary system prompts, in-context learning, repeated sampling, and self-verification (model-generated test/debug cycles) to quantify the robustness of solution generation to documentation level, code samples, and self-correction (Michailidis et al., 6 Jun 2025).

All methodologies emphasize reproducibility and comparability, supporting empirical claims about solver architecture (portfolio vs single-engine), agent workflows (fixed vs agentic iterative), and prompt/coding interface design (Szeider, 10 Aug 2025, Amadini et al., 2015).

4. Modelling Languages, Abstraction Levels, and Framework Integration

CP-Bench instances and evaluation protocols target multiple abstraction layers and modelling languages:

High-Level Abstract Languages: MiniZinc, CPMpy, and other Python-embedded frameworks provide declarative, expressive interfaces familiar to both researchers and modern code LLMs.
Low-Level Solver APIs: OR-Tools (Python), and OPL, deliver compact yet strict representations, exposing solver-specific modelling idioms and requiring deeper technical fluency.
Format Interoperability: The dataset structure and reference models are consciously designed to map between these layers, allowing comparative experiments on how abstraction and syntax influence both solver performance and automation success.
Benchmark Querying and Repository Access: Tools like the XCSP3-based web platform (Boussemart et al., 2016) and metamodel-driven repositories enable sophisticated querying, subsetting, and selection of problem classes or parameter regimes relevant for target evaluations.

This language and format diversity is crucial for benchmarking agentic or LLM-driven systems, as the ability to generate executable models depends on both the expressivity of the target interface and LLM exposure to its syntax during pretraining (Michailidis et al., 6 Jun 2025).

5. Applications in Automated and Agentic Constraint Modelling

The CP-Bench Benchmark Set has been instrumental in both traditional benchmarking and recent advances in agentic, LLM-driven CP system development:

LLM-Based Modelling Support: CP-Bench has demonstrated that Python-based frameworks (especially CPMpy) enable LLMs to achieve higher modelling accuracy (up to 70% solution-level accuracy in best configurations), outperforming lower-level or more domain-specific languages like MiniZinc under the same prompt and sampling regime (Michailidis et al., 6 Jun 2025).
Agentic Strategies: The agentic approach, exemplified by CP-Agent (Szeider, 10 Aug 2025), utilizes iterative, stateful code generation and runtime feedback in a persistent IPython kernel. With prompt-encoded domain expertise, such agents have been shown to solve all 101 CP-Bench problems by iterative reasoning, dynamic debugging, and self-verification—contrasting with the significant failure rate of fixed workflow methods.
Prompt Engineering and Self-Verification: Experiments show that documentation-rich prompts, code sampling with majority voting, and self-debugging protocols significantly improve LLM output quality and completeness, while retrieval-augmented in-context learning was less impactful in this domain (Michailidis et al., 6 Jun 2025).
Multi-Objective Benchmarking: CP-Bench instances facilitate not only absolute solution assessment but also finer-grained evaluations—such as discriminative ability between solvers or difficulty grading for adaptive instance generation (Dang et al., 2022).

6. Benchmarking Impact, Reproducibility, and Future Prospects

The CP-Bench suite is designed to standardize CP solver and assistant evaluation, catalyzing advances in modelling automation and solver engineering:

Benchmarking Standardization: By collecting and harmonizing diverse CP instances in accessible formats, CP-Bench enables direct, reproducible comparison of solvers, portfolio systems, LLM agents, and automated modellers across a representative subset of practical combinatorial problems (Michailidis et al., 6 Jun 2025).
Driving Research in Automated Modelling: CP-Bench underscores that modelling quality and automation are gated by both agent/LLM capabilities and interface design; results indicate that general coding agents, when supplied with domain expertise via project prompt, outperform fixed architecture or static code generation approaches (Szeider, 10 Aug 2025).
Open Challenges and Future Directions: CP-Bench motivates further work expanding coverage to industrial-scale and multi-objective instances, richer format interconversion, automated instance generation for difficulty/discrimination, and deeper integration into continuous integration pipelines for solver and modeller development (Dang et al., 2022, Boussemart et al., 2016). Additional opportunities include exploring agentic strategies that blend prompt learning with adaptive search, and expanding format coverage to include SMT, MIP, and other constraint-based paradigms.
Supporting Broader CP Adoption: By lowering the barrier for non-expert users to obtain executable, correct models from problem descriptions, and by fostering more robust, accurate, and general solver and modelling agent construction, CP-Bench enhances the practical scalability and accessibility of constraint programming as a core optimization and reasoning technology.

7. Summary Table: Core Features of CP-Bench

Feature/Aspect	Description	Example/Source
Problem Classes	CSP, COP, with 241 constraint types	(Michailidis et al., 6 Jun 2025)
Modelling Frameworks	MiniZinc, CPMpy, OR-Tools CP-SAT, OPL, XCSP3	(Boussemart et al., 2016, Michailidis et al., 6 Jun 2025)
Evaluation Metric	Output-level feasibility/optimality; solution accuracy	(Michailidis et al., 6 Jun 2025)
Number of Instances	101 (main set), extensible	(Michailidis et al., 6 Jun 2025)
Source Repositories	CSPLib, CPMpy, Kjellerstrand, academic courses	(Michailidis et al., 6 Jun 2025)
Supported Applications	Solver benchmarking, LLM/agent-powered modelling, instance grading	(Michailidis et al., 6 Jun 2025, Dang et al., 2022)

In summary, CP-Bench constitutes an essential tool for advancing reproducible, rigorous research in constraint programming—defining a common ground for evaluating solvers, agentic modellers, and LLM-powered assistants on a standard suite of challenging, representative problems (Michailidis et al., 6 Jun 2025, Szeider, 10 Aug 2025).