Canonical Intermediate Representation (CIR)
- Canonical Intermediate Representation (CIR) is a structured semantic layer that standardizes the conversion of natural language descriptions into rigorously defined, solver-ready optimization models.
- It enforces strict grammatical, syntactic, and ordering rules to eliminate ambiguity in problem formulations, ensuring a unique, deterministic mapping to mathematical representations.
- Advances in CIR integrate multi-agent pipelines and paradigm-specific templates for handling complex models like MILP, enhancing mapping accuracy and operational robustness.
The Canonical Intermediate Representation (CIR) is a structured semantic layer that bridges high-level natural language descriptions of optimization problems and their formal mathematical instantiations, most notably in the automated formulation of linear, integer, and mixed-integer programs. CIR acts as a domain-specific language or schema, guaranteeing a unique, unambiguous textual or symbolic representation of problems by enforcing strict grammatical, syntactic, and ordering constraints. Recent research has also extended CIR to encompass a knowledge-driven, multi-agent pipeline capable of handling complex operational semantics across multiple modeling paradigms, thereby decoupling problem intent from its specific solver-friendly realization (Jang, 2022, Lyu et al., 2 Feb 2026).
1. Formal Definition and Canonicality Guarantees
CIR was initially introduced as a minimal domain-specific language for linear programs, defined strictly as a sequence of declarations—one objective, followed by a (possibly empty) list of constraint declarations. The CIR grammar and ordering rules enforce a deterministic mapping from a problem’s semantic structure to a single canonical string, eliminating ambiguity due to commutativity, reordering, or variable renaming. The grammar is:
- Top-level:
CIR ::= ObjectiveDecl ConstraintDecl*
- Objective Declaration:
ObjectiveDecl ::= "maximize" LinearExpr | "minimize" LinearExpr
- Constraint Declaration:
ConstraintDecl ::= LinearExpr CompOp Constant, with CompOp ∈ {“≤”, “≥”} and Constant ∈ ℤ
- Linear Expression Decomposition:
LinearExpr ::= Term ('+' Term)*; Term ::= Coeff Var; Coeff ∈ ℤ, Var ∈ {x, y, z, w, x₁, …}
- Ordering Rules:
- Objective precedes constraints
- Constraints follow a fixed archetype ordering (lower-bound, upper-bound, xy-type, xby-type, sum-type, arbitrary linear, ratio-type)
- Terms within each constraint are lexicographically ordered by variable
- Within identically-shaped constraints, “≤” precedes “≥”
- Variable names are drawn from a fixed vocabulary
This canonicalization ensures that every abstract LP, modulo algebraic equivalence, maps to exactly one CIR representation (Jang, 2022).
2. CIR Representation in Mathematical and LaTeX Notation
While CIR is a textual format for NLP-to-mathematical program pipelines, its semantics are often displayed with conventional mathematical notation:
- Variables: as “x₁, x₂, …, xₙ” or “x, y, z, w”
- Objective:
CIR: maximize 3 x + 4 y LaTeX:
- Constraints:
CIR: 1 x + 2 y ≤ 50 LaTeX: , where
Coefficients of 1 are omitted (“1 x” → “x”), but others are not elided. This translation guarantees lossless mapping from CIR to matrix-based solver inputs such as AMPL or JSON (Jang, 2022).
3. Rule-to-Constraint Schema and Semantic Layering
Advances in CIR have generalized its scope from just linear programs to a flexible structured schema suitable for logic-heavy, mixed-integer, and quadratic programs. In this schema, a problem description is transformed into:
- A set of operational rules and entities/parameters
- A multiset of instantiated templates , where each represents:
- Core intent (e.g., NonOverlap, Precedence, Capacity)
- Source rule from the input
- Modeling paradigm (time-indexed, continuous-time, event-based, arc-flow, etc.)
The CIR-to-model mapping takes the union across all paradigm-specific constraint templates:
where is the full set of mathematical constraints for intent under paradigm . This explicit intermediate abstraction ensures that all feasible solutions to satisfy the operational rules by construction (Lyu et al., 2 Feb 2026).
4. Constraint Archetypes and Modeling Paradigms
CIR organizes operational rules into a compact set of archetypes, each with reference templates in multiple modeling paradigms. Key archetypes include:
| Archetype | Example CIR Expression | LaTeX/Mathematical Formulation |
|---|---|---|
| Assignment | ||
| Precedence | ||
| Capacity | ||
| Non-overlap | Disjunction in start/completion or time-index variables | Various forms, e.g., |
| Time Windows | ||
| Integer-Multiple | with constraints | |
| Logical/Indicator |
Each archetype is instantiated for several paradigms: time-indexed MILP, continuous-time MILP, event-based, and arc-flow models. The CIR library maps each abstract intent to paradigm-specific templates:
5. CIR-Driven NLP Pipelines: Tagging, Embedding, and Multi-Agent Architectures
The text-to-CIR mapping employs a two-stage (and, in recent work, multi-agent) approach:
BART with Entity Tags (Jang, 2022):
- Word-piece tokens and entity tags (identifying coefficients, variables, comparators, etc.) are embedded:
with tag embeddings weighted by a scaling factor .
- The BiBART encoder processes sum embeddings to produce CIR outputs, which remain canonical due to grammar constraints.
Multi-Agent Rule-to-Constraint (R2C) Pipeline (Lyu et al., 2 Feb 2026):
- Extractor tags and extracts rules and entities from natural language.
- Mapper retrieves CIR intent templates per rule and binds parameters.
- Formalizer composes the full model, emitting both mathematical and solver-executable formulations.
- Checker verifies structural and semantic soundness end-to-end.
Retrieval of CIR templates is based on domain tags and semantic similarity, supported by in-memory and FAISS vector indices.
6. Illustrative Examples
Linear Programming Canonicalization (Jang, 2022):
- NL statement: "A factory makes A ($3), B ($4); 1/2 machine-hours; ≤50 machine-hours; decide how many to produce; nonnegativity."
- CIR: maximize 3 x + 4 y x ≥ 0 y ≥ 0 1 x + 2 y ≤ 50
Rich Rule Extraction and Paradigm Selection (Lyu et al., 2 Feb 2026):
- NL statement: "Each job must finish before any downstream job begins, and at most one job can use Machine 1 at a time. Minimize makespan."
- Extractor maps precedence and non-overlap to intents.
- Mapper chooses continuous-time paradigm, instantiates correct templates.
- Formalizer emits constraints in both math and Gurobi-Python code.
7. Empirical Performance and Practical Impact
Empirical results on recognized benchmarks emphasize CIR’s role as an indispensable intermediate layer:
| System | CIR Layer | Key Accuracy (%) | Benchmark |
|---|---|---|---|
| BART-large, λ=5 | Yes | 88.46 (declaration accuracy) | LPWP validation (Jang, 2022) |
| R2C (7B LLM) | Yes | 47.2 (Accuracy Rate, AR) | ORCOpt-Bench (Lyu et al., 2 Feb 2026) |
| GPT-5 | No | 39.8 (AR) | ORCOpt-Bench (Lyu et al., 2 Feb 2026) |
| R2C ablation (no CIR) | No | 31.6 (AR) | ORCOpt-Bench (Lyu et al., 2 Feb 2026) |
| R2C+reflection | Yes | 54.0 (AR) | ORCOpt-Bench (Lyu et al., 2 Feb 2026) |
Key findings include:
- Tag- and entity-aware embeddings, especially with high λ, notably increase mapping accuracy in classic LP settings (Jang, 2022).
- Introducing CIR in multi-agent frameworks yields a substantial accuracy gain over both proprietary and open baselines.
- CIR enables training-agnostic, retrieval-based formulations that remain competitive with tuned LLMs on industry-scale tasks.
- The reflection enhancement in R2C further boosts robustness in complex, compositional problem settings (Lyu et al., 2 Feb 2026).
In summary, CIR establishes an extensible, verifiable, and semantically sound layer between natural-language problem specification and executable optimization code. By encoding intents as archetypes with paradigm-specific instantiations, CIR enables both symbolic and data-driven models to robustly and transparently translate diverse operational rules into solver-ready programs (Jang, 2022, Lyu et al., 2 Feb 2026).