CodeMetaAgent: Dual AI for Simulation & Software
- CodeMetaAgent (CMA) is a dual-mode framework that employs AI-driven logic for simulation model synthesis and metamorphic testing for LLM-generated software.
- It integrates structured scientific knowledge representation (SSKR) with classical planning and metamorphic relations to refine task specifications and generate robust code artifacts.
- The system enhances code quality through traceable, high-coverage pipelines that deliver performance-optimized, reproducible code for both high-performance simulations and automated software engineering.
CodeMetaAgent (CMA) refers to two distinct but related models in recent computational science and AI-driven software engineering. In the context of scientific simulation, CMA is an AI-driven logic and code generation agent at the core of the MAGCC (Machine Assisted Generation, Calibration, and Comparison) framework. Separately, in LLM-driven software automation, CMA designates a metamorphic relation-guided agent architecture for refining task specifications and generating code and robust test suites. Both instances operationalize knowledge-centric, end-to-end automation pipelines, but differ fundamentally in their core methodologies, target domains, and integration with knowledge representation or specification mutation constructs (Cockrell et al., 2022, Akhond et al., 23 Nov 2025).
1. Core Definitions and Roles
CMA in the MAGCC framework is an AI planning and code generation agent that bridges structured domain knowledge, represented via the Structured Scientific Knowledge Representation (SSKR), to formal simulation model specifications and subsequently to executable high-performance code. Its “CodeMetaAgent” alias encapsulates its primary purpose: metadata-driven, end-to-end, traceable code synthesis from domain knowledge (Cockrell et al., 2022).
In contrast, CMA in the context of metamorphic relation-guided LLM agents is an LLM-centric software engineering assistant that employs Metamorphic Relations (MRs) to systematically mutate, refine, and validate both human task specifications and test cases. This architecture targets ambiguity reduction and coverage maximization in LLM-generated software artefacts (Akhond et al., 23 Nov 2025).
2. High-Level Architectures
MAGCC CodeMetaAgent
The CMA within MAGCC orchestrates three core modules:
- SSKR-Interface Module: Extracts and normalizes five structured subcomponents of SSKR ([MRM], [MRS], [DDT], [MFM], [MKM]), translating this tabular or graph-formalized knowledge into an internally uniform logical representation.
- Logical Reasoning/Planning Engine: Utilizes the Maude rewriting-logic system, applying domain-to-model and model-to-code mapping rewrite rules via breadth-first state space search to derive one or more valid mathematical model specifications (ODEs, PDEs, ABMs, Petri nets).
- Code Generation Module: Translates finalized model specifications and execution metadata into compilable source code and corresponding build systems, invoking parameterized templates for environments such as CUDA kernels, C++/Python PDE solvers, or BioSwarm agent-based simulation stubs.
Metamorphic Specification Mutation Agent
The LLM-driven CMA comprises four cooperative modules:
- Mutator: Applies a suite of MRs to task descriptions or test sets, generating semantically constrained variants (via transformations such as paraphrase, negation, translation, and algebraic manipulation).
- Reviewer: Validates mutated specifications or test cases using semantic similarity (SBERT), enforcing a similarity threshold (e.g., 0.8) or behavioral invariants on test executions.
- Generator: Instantiates the target LLM (e.g., GPT-4o, Mistral-Large, GPT-OSS, Qwen3-Coder) to synthesize code, patches, or test suites for each validated variant.
- Evaluator: Measures code artefact correctness (Pass@k metrics) and test case coverage (via tools such as coverage.py), reporting correctness, and branch coverage gains.
3. Formal Knowledge Representations and Specification Mutation
Structured Scientific Knowledge Representation (SSKR) [MAGCC]
SSKR provides a five-tuple schema:
- MRM (Model Rule Matrix): Encodes variable-rule relationships as a matrix, with entries indicating coefficients, parameter sets, or forbidden interactions.
- MRS (Model Rule Structure): Associates each rule with functional forms, expressed in MathML, capturing kinetic alternatives or parameterizations.
- DDT (Discretization, Dimensionality, Topology): Specifies computational mesh, dimensions, and boundary/topological constraints.
- MFM (Model Flow Matrix): Defines the scheduling graph for model evaluation—event loops and dependency sequencing.
- MKM (Model Knowledge Matrix): Records full provenance trails for model statements, mapping each component to extracted scientific facts or literature references.
Metamorphic Relations (MRs) [LLM CMA]
MRs are transformations of a specification or test case such that semantics are preserved: . MRs span:
- Linguistic: Negation (MR1), Translation (MR2), Step-wise Decomposition (MR3), Paraphrase (MR4)
- Structural/Math: Variable Swapping (MR5), Input Permutation (MR6), Algebraic/Distributive (MR7), Domain Subsets (MR8), Incremental shifts (MR9)
MR application is tightly coupled with reviewer validation to retain only high-fidelity variants ( for descriptions, behavioral invariance for test cases).
4. Reasoning, Planning, and Generation Algorithms
MAGCC CMA Planning
The logical engine frames SSKR-to-specification mapping as a classical planning problem. Let represent initialized ground atoms for the model, the goal (fully assembled target model spec), and as rewriting actions. A generic algorithm:
- Initialize with
- While nonempty: pop , if satisfies , append to plans; else, for each applicable , enqueue .
- Rule sequences are fully traceable, mapping each code artefact to SSKR provenance.
Metamorphic Specification Mutation (LLM CMA)
The mutation pipeline is formalized as:
1 2 3 4 5 6 7 8 9 |
D_valid = set() for R in MRs: for iter in range(K): d_candidate = LLM_mutator(buildPrompt(d0, R)) score = Reviewer.similarity(d0, d_candidate) if score >= tau: D_valid.add(d_candidate) break return D_valid |
Once is obtained, each variant is fed through code/test case generation, followed by evaluator module execution.
5. Quantitative Performance, Coverage, and Examples
Empirical Results (LLM CMA)
Empirical evaluation over HumanEval-Pro, MBPP-Pro, and SWE-Bench_Lite datasets using GPT-4o, Mistral-Large, GPT-OSS, and Qwen3-Coder demonstrates:
- Code Generation Accuracy: Up to +17 percentage point gain in Pass@1 (MBPP-Pro on GPT-OSS and HumanEval-Pro on Mistral), with gains observed across model/dataset pairs. Maximum observed code coverage reached 99.81%.
- Branch Coverage: MR-augmented test generation raises coverage to near saturation, e.g., MBPP-Pro (Oracle: 99.36%, CMA: 99.81% with GPT-4o).
- Test-Case Correctness: >75% average correctness across models, with Qwen3 highest at 93.14% (MBPP-Pro).
- Bug Fixing: On SWE-Bench_Lite (GPT-OSS, agentless), patch resolution increased from 30.7% to 38.7%.
Illustrative Examples
- Code Generation: MR3-driven step-wise decomposition resolves ambiguities in prompt-to-code translation, correcting failure cases in code synthesis (e.g., octagonal number summation).
- Test-Case Mutation: MR-guided permutations, variable swapping, and incremental modifications of oracle tests enlarged prior coverage deficiencies (e.g., complex palindrome generation).
MAGCC Workflow
Example from gut-mucus stratification: Natural language assertions are mapped via SSKR population to ODE model generation, then compiled to CUDA-based GPU code. The planning trace and code generation are fully reproducible and provenance-preserving.
6. Limitations and Prospective Directions
Limitations
- LLM-driven CMA incurs heavy token and compute costs due to repeated Mutator module calls (Akhond et al., 23 Nov 2025).
- Reviewer subsystems utilize only semantic similarity and behavioral consistency without formal program verification.
- Evaluation to date is constrained to Python-centric benchmarks; cross-language performance is uncharacterized.
- Proprietary LLM evaluations are subject to API versioning and result non-determinism.
Future Directions
- Development of rule-based or symbolic MR operators to reduce LLM overhead.
- Adaptive MR selection for maximizing per-task utility.
- Multi-agent frameworks decoupling MR generation, synthesis, and verification agents.
- Expansion to SE tasks beyond code/test synthesis (e.g., documentation, performance regression).
- Standardization of “MRbench” for comparative evaluation.
7. Contextual Significance and Cross-Domain Relevance
CodeMetaAgent, in both scientific modeling and software engineering, exemplifies traceable, knowledge-centered automation pipelines. In MAGCC, it enables systematic conversion of formalized, ontologically annotated scientific knowledge into high-performance simulation code while preserving full provenance. In LLM-driven specification mutation, it proactively exploits semantic-preserving transformations to confront ambiguity, enhance correctness and coverage, and mitigate LLM brittleness, positioning MRs not as mere test oracles but as first-class operators in program synthesis (Cockrell et al., 2022, Akhond et al., 23 Nov 2025). This dual lineage illustrates convergent trends toward explainable, robust, and provenance-aware automation across scientific knowledge management and AI-augmented software engineering.