CP-Model-Zoo: CP Model Retrieval System
- CP-Model-Zoo is a constraint programming system that maps MiniZinc models and natural language descriptions into a unified embedding space for efficient retrieval.
- It leverages state-of-the-art embedding techniques and LLM-generated descriptions to achieve high retrieval accuracy, with empirical MRR values of 0.90 or higher.
- The scalable and incremental design integrates new models seamlessly without manual labeling, enhancing model accessibility for both novice and expert users.
CP-Model-Zoo refers to an intelligent system for constraint programming (CP) model retrieval and tutoring, designed to bridge the gap between natural language problem descriptions and expert-level CP models. The primary objective is to support both novice and expert users in finding high-quality, validated MiniZinc source code for combinatorial problems without manual data labeling or deep technical expertise in constraint programming languages. CP-Model-Zoo leverages a database of models, enriched with layered natural language descriptions, and utilizes state-of-the-art embedding techniques to facilitate accurate and scalable retrieval based on user queries.
1. System Architecture and Database Organization
CP-Model-Zoo organizes its knowledge base as a database comprising CP models in MiniZinc format, optionally augmented by natural language descriptions targeting various expertise levels (novice, intermediate, expert). Each database entry can consist of just the source code, or a composite string formed by concatenating code with multiple LLM-generated descriptions. These text variants are produced using prompt-engineered LLMs.
For retrieval efficiency, each entry is mapped by a pre-trained embedding function to vector space , resulting in a set of stored model embeddings . This preprocessing eliminates the need for database-wide post-query computations, as only the user’s query embedding is computed at runtime.
The system is accessed via a web-based interface (e.g., using Gradio), which supports query entry and displays ranked, clickable model candidates. Each candidate reveals the corresponding MiniZinc code for paper or usage.
2. Natural Language Query Processing and Model Retrieval
The user submits a combinatorial problem description in natural language (query ). CP-Model-Zoo processes this input by:
- Generating an embedding vector .
- Calculating cosine similarity for each database entry:
- Retrieving the top- models that maximize similarity:
This mechanism delivers rapid and scalable retrieval, as all model embeddings are precomputed and only a single query embedding needs to be generated at runtime. The retrieval strategy exploits semantic similarity in learned embedding space, effectively handling linguistic variance and expert jargon.
3. Embedding Techniques and Description Augmentation
The embedding model (e.g., modernBERT) is chosen for its ability to represent both source code and natural language descriptions in a unified semantic space. Augmenting model entries with LLM-generated descriptions at multiple expertise levels significantly improves retrieval accuracy. Descriptions are crafted via systematic prompt engineering that simulates user queries with varying domain knowledge, increasing semantic coverage.
Empirical results show that combinations of source code and intermediate-level descriptions yield the highest mean reciprocal rank (MRR) scores. This approach ensures that both technical details and lay explanations contribute to the matching process.
4. Mathematical Formalism and Evaluation Metrics
The mathematical foundation of CP-Model-Zoo centers around vector-space retrieval and similarity computation. The embedding function
maps any text string to a -dimensional vector. Cosine similarity measures the directionality between query and model vectors, providing a robust metric for semantic closeness.
Model retrieval performance is measured using Mean Reciprocal Rank (MRR):
where is the position of the correct model for query in the ranked output. Higher MRR values indicate superior retrieval (MRR = 1 implies perfect retrieval).
5. Experimental Evaluation and Performance Analysis
CP-Model-Zoo has been empirically evaluated using:
- MiniZinc example problems and CSPLib entries with human-written queries (natural language descriptions).
- Variants of retrieval embeddings: source code only (SC), and combinations of SC with LLM-generated descriptions (D1–novice, D2–intermediate, D3–expert).
Key findings:
- Augmenting retrieval with intermediate-level LLM descriptions consistently yields MRR values , outperforming source-code-only embeddings.
- Even with only source code, the system maintains robust retrieval, indicating the efficacy of pre-trained LLM embeddings.
- Leave-one-out tests confirm generalization; new models are incorporated incrementally with no need for additional data labeling.
- Performance remains high even on queries simulating different user expertise levels, demonstrating accessibility for both novices and practitioners.
6. Incremental Enhancement and Scalability
New CP models are added to the database in a modular, scalable fashion: Generate LLM-based descriptions (at multiple expertise levels), compute embeddings, and store the results. This design ensures that CP-Model-Zoo remains up-to-date and is not bottlenecked by manual curation or data labeling. The incremental architecture allows for continuous improvement and real-time integration of new modeling techniques or problem domains.
The system is decoupled from particular CP solvers, focusing solely on model retrieval. It provides a foundation for further development toward interactive tutoring systems or automated code synthesis pipelines.
7. Significance and Implications for Constraint Programming Practice
CP-Model-Zoo increases the accessibility of constraint programming by abstracting the details of expert model creation and selection. By relying on expert-validated MiniZinc code and augmenting with natural language descriptions, it democratizes model reuse and understanding across varying skill levels. The avoidance of manual labeling and the robustness of embedding-based retrieval support high scalability.
This suggests that future efforts in constraint programming pedagogy and automated modeling may benefit from such embedding-enhanced knowledge bases, potentially integrating further with code synthesis tools or interactive problem-solving environments. A plausible implication is the emergence of CP-Model-Zoo as the core of intelligent aids for combinatorial optimization in both research and industry.