iEcoreGen: Automated Code & Model Generation
- iEcoreGen is a system that automates code and instance model generation using advanced E-generalization and LLM-based techniques.
- It employs on-the-fly weight decomposition with cache-based optimization to efficiently synthesize minimal-weight generalizers.
- The framework integrates symbolic computation with LLM-driven code completion and model synthesis, enhancing both automation and performance.
iEcoreGen encompasses a set of algorithmic frameworks and toolchains for automating code and instance model generation in model-based engineering workflows. The term refers to two distinct but thematically related systems: (1) a high-efficiency algorithm for E-generalization over equational theories, and (2) a suite of LLM-augmented EMF code and instance model generators that leverage structured prompting and strict schema mediation. The core innovations across these incarnations are algorithmic pipeline optimization, prompt-template engineering, and hybrid integration of symbolic and neural code synthesis. The following sections enumerate and analyze the fundamental principles, architecture, empirical performance, and limitations of iEcoreGen in both the equational and model-driven contexts.
1. E-Generalization Algorithmic Foundation
E-generalization generates the set of common generalizations of ground terms with respect to a given equational theory , making it a key primitive in automated reasoning and program synthesis (Burghardt, 2017). For a fixed signature , consists of confluent, terminating term-rewrite rules for arithmetic operators, yielding ground terms modulo . The algebra of values is . The E-generalization task for ground terms computes all terms with for given substitutions and goals .
2. From Grammar-Based to Weight-Decomposition Enumeration
The original approach represented equivalence classes with nondeterministic regular tree grammars, performing grammar intersection, lifting, and minimal-weight enumeration via Knuth’s algorithm and automaton constructions. However, the combinatorial explosion rendered this infeasible for practical input sizes (e.g., intersection grammars exceeding alternatives for three values) (Burghardt, 2017).
iEcoreGen supersedes this by eschewing explicit grammar construction. Instead, it enacts on-the-fly generation of weight decomposition lists—tuples expressing term-forming operations and argument weights—using a binary heap prioritized by total weight and operator rank. Terms are synthesized directly from decomposition lists and evaluated to normal forms under input substitutions. A value-pair cache maps each multivalued normal form to its minimal-weight generating term. This guarantees that the first occurrence of yields the minimal-weight generalizer.
Algorithmic properties:
- No duplication in decomposition lists.
- Nondecreasing weights in enumeration.
- Simulated grammar operations (intersection, minimization, emptiness check, normalization) by direct term-level computation and cache lookups.
3. Software Architecture and Optimizations
The C implementation comprises tightly modularized components:
| Module Group | Key Files | Purpose |
|---|---|---|
| core | weightTerm.c, heap.c | Heap management, decomposition lists, term table |
| core | contTab.c, bbt.c | Terms-by-weight lists, value-pair map (balanced trees) |
| kernel | opTab.c, valDefTab.c, parser.c, redex.c | Operator signatures, value tables, parsing, redex tabulation |
| user | UserOpArith.c, UserVal.c, UserWgf.c | Operator definitions, value handlers, weight metrics |
Data structures emphasize compactness (array-backed decomposition lists, reference-based subterm sharing), efficient lookup (two-level hash-trees for ), and weight-segmented caches.
Performance is further enhanced by:
- AC(I) pruning for associative, commutative, idempotent operations—restricts combinatorial term generation, quickly discards redundant subtrees.
- Sort-based pruning enforcing many-sorted signatures.
- Immediate normalization and subterm sharing on term construction.
- Data-term lattices for operators involving partitions (e.g., "if" or projection, supporting rapid detection of composite solutions).
4. LLM-Hybrid Model and Code Generation Workflows
A second usage of iEcoreGen denotes hybrid code-generation toolchains integrating EMF template generation with LLMs (He et al., 5 Dec 2025, Pan et al., 28 Mar 2025). The pipeline orchestrates four main stages:
- Requirement decomposition via LLMs, extracting per-operation specifications in structured docstrings from natural-language and PlantUML-encoded Ecore models.
- EMF code generation—Java Emitter Templates (JET) generate class files with placeholder methods and docstring specifications.
- LLM code completion, enhanced by AST code compression (reducing context size) and extraction of relevant method signatures from related classes.
- LLM-based code fixing, carrying out compilation, error analysis, and iterative correction until successful compilation or retry budget exhaustion.
Formal metrics employed for assessment are:
- pass@k (expected fraction of problems with a passing solution among out of candidates),
- compilation@k (fraction compiling successfully among candidates).
In instance model generation (Pan et al., 28 Mar 2025), iEcoreGen employs a two-step method:
- LLM generates a Conceptual Instance Model (CIM): strict schema JSON, listing instance IDs, class types, attributes, compositions, and references.
- Instance compiler (PyEcore) transforms CIM into guaranteed-valid XMI, decoupling semantic extraction from syntactic constraints.
5. Empirical Results and Ablation Analyses
Tabulated results showcase substantial computational improvement in both domains:
E-Generalization
| Problem Type | #Ex | Prolog Time (s) | iEcoreGen Time (s) | Speedup |
|---|---|---|---|---|
| Seq (5-term Fibonacci) | 3 | 2.3 | 0.02 | 115× |
| Seq (6-term IST'70 A1-102) | 4 | 180 (O/M) | 0.12 | >1,500× |
| Rand (depth 6, 4 vars) | 10 | 14 | 0.25 | 56× |
| Rand (depth 8, 6 vars) | 10 | O/M | 1.9 | — |
iEcoreGen handles weight levels up to ~20, yielding millions of candidate terms per second. Bottlenecks persist for high-arity operators and large tables; future work is directed at parallelization and incremental sharing.
LLM-Enabled Generation
For code generation (He et al., 5 Dec 2025), iEcoreGen achieves a 5%–52% absolute gain in pass@1 (29% avg), 11%–36% in pass@3 (22% avg), with compilation@k on par with LLM-only pipelines in functional benchmarks. Ablation studies showed criticality of requirement decomposition, code compression, context extraction, and code fixing components.
For instance model synthesis (Pan et al., 28 Mar 2025), validity rates reached 100% across all tested LLMs and models, far exceeding intuitive or repaired direct XMI approaches (≤59%). Semantic accuracy ranged from 60% (Llama 3.1-8B) to 93% (GPT-4o), and the approach was robust across proprietary and open-source models.
6. Limitations, Threats, and Future Directions
Limitations for the E-generalization implementation include memory and branching factor for very large operators and value tables. The LLM-augmented generators currently handle modest benchmarks, with potential challenges for industrial-scale metamodels and richer specifications.
Future work includes:
- Parallel and GPU-accelerated enumeration (E-generalization).
- Improved context gathering and requirement decomposition for LLM pipelines.
- Extension of instance model synthesis to other MOF languages and richer association structures through RAG techniques and more sophisticated compilers.
- Integration as an official extension for EMF-related infrastructure.
A plausible implication is that strict schema mediation plus symbolic post-processing may be generalized beyond model-driven engineering to other program synthesis domains requiring high semantic and grammatical fidelity.
All code, prompts, and benchmarks for the instance model workflow are publicly available at https://github.com/your-repo/iEcoreGen (Pan et al., 28 Mar 2025).