Canonical Intermediate Representation (CIR)

Updated 9 February 2026

Canonical Intermediate Representation (CIR) is a structured semantic layer that standardizes the conversion of natural language descriptions into rigorously defined, solver-ready optimization models.
It enforces strict grammatical, syntactic, and ordering rules to eliminate ambiguity in problem formulations, ensuring a unique, deterministic mapping to mathematical representations.
Advances in CIR integrate multi-agent pipelines and paradigm-specific templates for handling complex models like MILP, enhancing mapping accuracy and operational robustness.

The Canonical Intermediate Representation (CIR) is a structured semantic layer that bridges high-level natural language descriptions of optimization problems and their formal mathematical instantiations, most notably in the automated formulation of linear, integer, and mixed-integer programs. CIR acts as a domain-specific language or schema, guaranteeing a unique, unambiguous textual or symbolic representation of problems by enforcing strict grammatical, syntactic, and ordering constraints. Recent research has also extended CIR to encompass a knowledge-driven, multi-agent pipeline capable of handling complex operational semantics across multiple modeling paradigms, thereby decoupling problem intent from its specific solver-friendly realization (Jang, 2022, Lyu et al., 2 Feb 2026).

1. Formal Definition and Canonicality Guarantees

CIR was initially introduced as a minimal domain-specific language for linear programs, defined strictly as a sequence of declarations—one objective, followed by a (possibly empty) list of constraint declarations. The CIR grammar and ordering rules enforce a deterministic mapping from a problem’s semantic structure to a single canonical string, eliminating ambiguity due to commutativity, reordering, or variable renaming. The grammar is:

Top-level:

CIR ::= ObjectiveDecl ConstraintDecl*

Objective Declaration:

ObjectiveDecl ::= "maximize" LinearExpr | "minimize" LinearExpr

Constraint Declaration:

ConstraintDecl ::= LinearExpr CompOp Constant, with CompOp ∈ {“≤”, “≥”} and Constant ∈ ℤ

Linear Expression Decomposition:

LinearExpr ::= Term ('+' Term)*; Term ::= Coeff Var; Coeff ∈ ℤ, Var ∈ {x, y, z, w, x₁, …}

Ordering Rules:

Objective precedes constraints
Constraints follow a fixed archetype ordering (lower-bound, upper-bound, xy-type, xby-type, sum-type, arbitrary linear, ratio-type)
Terms within each constraint are lexicographically ordered by variable
Within identically-shaped constraints, “≤” precedes “≥”
Variable names are drawn from a fixed vocabulary

This canonicalization ensures that every abstract LP, modulo algebraic equivalence, maps to exactly one CIR representation (Jang, 2022).

2. CIR Representation in Mathematical and LaTeX Notation

While CIR is a textual format for NLP-to-mathematical program pipelines, its semantics are often displayed with conventional mathematical notation:

Variables: $x \in \mathbb{R}^n$ as “x₁, x₂, …, xₙ” or “x, y, z, w”
Objective:

CIR: maximize 3 x + 4 y LaTeX: $\max_{x} \; c^\top x$

Constraints:

CIR: 1 x + 2 y ≤ 50 LaTeX: $A x \leq b$ , where $A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m$

Coefficients of 1 are omitted (“1 x” → “x”), but others are not elided. This translation guarantees lossless mapping from CIR to matrix-based solver inputs such as AMPL or JSON (Jang, 2022).

3. Rule-to-Constraint Schema and Semantic Layering

Advances in CIR have generalized its scope from just linear programs to a flexible structured schema suitable for logic-heavy, mixed-integer, and quadratic programs. In this schema, a problem description $d$ is transformed into:

A set of operational rules $\mathcal{R}(d) = \{r_1,...,r_K\}$ and entities/parameters $\mathcal{E}(d)$
A multiset of instantiated templates $C(d) = (\mathcal{E}(d), \mathcal{A}(d))$ $C (d) = (E (d), A (d))$ , where each $a_\ell = (A_\ell, k_\ell, p_\ell)$ $a_{ℓ} = (A_{ℓ}, k_{ℓ}, p_{ℓ})$ represents:
- Core intent (e.g., NonOverlap, Precedence, Capacity)
- Source rule from the input
- Modeling paradigm (time-indexed, continuous-time, event-based, arc-flow, etc.)

The CIR-to-model mapping takes the union across all paradigm-specific constraint templates:

$M(d) = T(C(d)) = \bigcup_{\ell=1}^{L} \mathcal{C}_{A_\ell, p_\ell}$

where $\max_{x} \; c^\top x$ 0 is the full set of mathematical constraints for intent $\max_{x} \; c^\top x$ 1 under paradigm $\max_{x} \; c^\top x$ 2. This explicit intermediate abstraction ensures that all feasible solutions to $\max_{x} \; c^\top x$ 3 satisfy the operational rules $\max_{x} \; c^\top x$ 4 by construction (Lyu et al., 2 Feb 2026).

4. Constraint Archetypes and Modeling Paradigms

CIR organizes operational rules into a compact set of archetypes, each with reference templates in multiple modeling paradigms. Key archetypes include:

Archetype	Example CIR Expression	LaTeX/Mathematical Formulation
Assignment	$\max_{x} \; c^\top x$ 5	$\max_{x} \; c^\top x$ 6
Precedence	$\max_{x} \; c^\top x$ 7	$\max_{x} \; c^\top x$ 8
Capacity	$\max_{x} \; c^\top x$ 9	$A x \leq b$ 0
Non-overlap	Disjunction in start/completion or time-index variables	Various forms, e.g., $A x \leq b$ 1
Time Windows	$A x \leq b$ 2	$A x \leq b$ 3
Integer-Multiple	$A x \leq b$ 4	$A x \leq b$ 5 with constraints
Logical/Indicator	$A x \leq b$ 6	$A x \leq b$ 7

Each archetype is instantiated for several paradigms: time-indexed MILP, continuous-time MILP, event-based, and arc-flow models. The CIR library maps each abstract intent to paradigm-specific templates:

$A x \leq b$ 8

(Lyu et al., 2 Feb 2026).

5. CIR-Driven NLP Pipelines: Tagging, Embedding, and Multi-Agent Architectures

The text-to-CIR mapping employs a two-stage (and, in recent work, multi-agent) approach:

BART with Entity Tags (Jang, 2022):

Word-piece tokens and entity tags (identifying coefficients, variables, comparators, etc.) are embedded:

$A x \leq b$ 9

with tag embeddings weighted by a scaling factor $A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m$ 0.

The BiBART encoder processes sum embeddings to produce CIR outputs, which remain canonical due to grammar constraints.

Multi-Agent Rule-to-Constraint (R2C) Pipeline (Lyu et al., 2 Feb 2026):

Extractor tags and extracts rules and entities from natural language.
Mapper retrieves CIR intent templates per rule and binds parameters.
Formalizer composes the full model, emitting both mathematical and solver-executable formulations.
Checker verifies structural and semantic soundness end-to-end.

Retrieval of CIR templates is based on domain tags and semantic similarity, supported by in-memory and FAISS vector indices.

6. Illustrative Examples

Linear Programming Canonicalization (Jang, 2022):

NL statement: "A factory makes A ( $A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m$ 14); 1/2 machine-hours; ≤50 machine-hours; decide how many to produce; nonnegativity."
CIR: maximize 3 x + 4 y x ≥ 0 y ≥ 0 1 x + 2 y ≤ 50

Rich Rule Extraction and Paradigm Selection (Lyu et al., 2 Feb 2026):

NL statement: "Each job must finish before any downstream job begins, and at most one job can use Machine 1 at a time. Minimize makespan."
Extractor maps precedence and non-overlap to intents.
Mapper chooses continuous-time paradigm, instantiates correct templates.
Formalizer emits constraints in both math and Gurobi-Python code.

7. Empirical Performance and Practical Impact

Empirical results on recognized benchmarks emphasize CIR’s role as an indispensable intermediate layer:

System	CIR Layer	Key Accuracy (%)	Benchmark
BART-large, λ=5	Yes	88.46 (declaration accuracy)	LPWP validation (Jang, 2022)
R2C (7B LLM)	Yes	47.2 (Accuracy Rate, AR)	ORCOpt-Bench (Lyu et al., 2 Feb 2026)
GPT-5	No	39.8 (AR)	ORCOpt-Bench (Lyu et al., 2 Feb 2026)
R2C ablation (no CIR)	No	31.6 (AR)	ORCOpt-Bench (Lyu et al., 2 Feb 2026)
R2C+reflection	Yes	54.0 (AR)	ORCOpt-Bench (Lyu et al., 2 Feb 2026)

Key findings include:

Tag- and entity-aware embeddings, especially with high λ, notably increase mapping accuracy in classic LP settings (Jang, 2022).
Introducing CIR in multi-agent frameworks yields a substantial accuracy gain over both proprietary and open baselines.
CIR enables training-agnostic, retrieval-based formulations that remain competitive with tuned LLMs on industry-scale tasks.
The reflection enhancement in R2C further boosts robustness in complex, compositional problem settings (Lyu et al., 2 Feb 2026).

In summary, CIR establishes an extensible, verifiable, and semantically sound layer between natural-language problem specification and executable optimization code. By encoding intents as archetypes with paradigm-specific instantiations, CIR enables both symbolic and data-driven models to robustly and transparently translate diverse operational rules into solver-ready programs (Jang, 2022, Lyu et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Tag Embedding and Well-defined Intermediate Representation improve Auto-Formulation of Problem Description (2022)

Canonical Intermediate Representation for LLM-based optimization problem formulation and code generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Canonical Intermediate Representation (CIR).

Canonical Intermediate Representation (CIR)

1. Formal Definition and Canonicality Guarantees

2. CIR Representation in Mathematical and LaTeX Notation

3. Rule-to-Constraint Schema and Semantic Layering

4. Constraint Archetypes and Modeling Paradigms

5. CIR-Driven NLP Pipelines: Tagging, Embedding, and Multi-Agent Architectures

6. Illustrative Examples

7. Empirical Performance and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Canonical Intermediate Representation (CIR)

1. Formal Definition and Canonicality Guarantees

2. CIR Representation in Mathematical and LaTeX Notation

3. Rule-to-Constraint Schema and Semantic Layering

4. Constraint Archetypes and Modeling Paradigms

5. CIR-Driven NLP Pipelines: Tagging, Embedding, and Multi-Agent Architectures

6. Illustrative Examples

7. Empirical Performance and Practical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research