Self-Improving Agentic Frameworks

Updated 29 December 2025

Self-Improving Agentic Frameworks are systems that autonomously generate and verify code transformations, ensuring semantic preservation and performance gains.
They employ formal semantics, simulation relations, and proof assistants to rigorously confirm that optimized variants meet established correctness criteria.
These frameworks implement closed-loop protocols that accept only transformations which maintain or enhance key metrics, preventing regressions in functionality.

Formal compiler correctness is the property that a compiler—or, more generally, a code transformation system—preserves specified semantic properties of the source program throughout the translation process, as precisely established by mathematical methods. The pursuit of formal compiler correctness seeks not only to guarantee functional equivalence between source and target artifacts but to provide explicit, machine-verifiable assurances grounded in formal semantics, rigorous proof techniques, and automated toolchains. The field encompasses a range of paradigms, from full end-to-end verified compilation down to local, agentic, or self-improving code optimizers that maintain well-typedness, memory safety, and performance objectives across diverse hardware and domain-specific languages.

1. Foundational Definitions and Objectives

The core objective of formal compiler correctness is to establish that, for any program $P$ in the source language $\mathcal{L}_S$ and its translation $\mathsf{C}(P)$ via compiler $\mathsf{C}:\mathcal{L}_S\to\mathcal{L}_T$ , a given semantic property $\mathcal{P}$ is preserved:

$\llbracket P\rrbracket_S = \llbracket \mathsf{C}(P)\rrbracket_T$

where $\llbracket\cdot\rrbracket_S$ and $\llbracket\cdot\rrbracket_T$ denote formal semantics in the source and target, respectively. The strictest form is semantic preservation (or contextual equivalence), though weaker guarantees such as partial correctness, refinement, or preservation of specific invariants (for instance, affine-type constraints in ML operator generation (Zhang et al., 4 Feb 2025)) are also common.

In self-optimizing or agentic systems, correctness may take the form of monotonic empirical improvement subject to regression-free refinement: every code transformation must either preserve or increase some well-defined utility function (e.g., test pass rate, throughput), and is only accepted if no correctness properties are lost (Zhang et al., 19 Nov 2025, Zhang et al., 4 Feb 2025, Wu et al., 24 Jul 2025).

2. Formal Methods and Verification Techniques

Modern approaches to formal compiler correctness draw from:

Operational and denotational semantics: Precise, machine-checkable definitions of program meaning in both source and target languages.
Simulation relations: Step-indexed logical relations or bisimulations are used to relate execution traces, supporting proofs of behavioral equivalence across complex translation steps (Zhang et al., 4 Feb 2025).
Proof assistants: Mechanized formalisms such as Coq, Isabelle/HOL, or Lean are applied for end-to-end, machine-checked verification (though in agentic/neural-based systems, correctness is frequently ensured by empirical or property-preserving mechanisms in the absence of proof assistants).
Property-based testing and verification pipelines: In practical, self-improving frameworks, correctness is enforced by automated test harnesses and static verifiers (e.g., syntactic affine-type checking via AST analysis, or functional regression via randomized test suites (Zhang et al., 4 Feb 2025, Zhang et al., 19 Nov 2025)).

A crucial architectural element is a verifier or evaluator module that enforces functional and non-functional requirements as a gate for accepting new variants (Zhang et al., 19 Nov 2025, Zhang et al., 4 Feb 2025, Wu et al., 24 Jul 2025).

3. Agentic and Self-Improving Compiler Frameworks

Recent advances have produced closed-loop, agentic systems that autonomously search for, propose, and verify code transformations, embedding correctness guarantees within their iteration protocols:

AccelOpt (Zhang et al., 19 Nov 2025): An LLM-driven kernel optimizer for AI accelerators that operates within a strict validation loop: for each code variant, a profiler checks correctness and performance, discarding any transformation that fails functional validation or introduces regressions. Memory curation explicitly labels positive (performance-improving and correctness-preserving) and negative (regression-inducing) rewrites, feeding only validated experience into further search. The selection routine $\beta$ enforces that at each iteration, only the fastest correct variant per plan group is persisted, guaranteeing monotonic improvement without breaking correctness.
Adaptive Self-Improvement for ML Libraries (Zhang et al., 4 Feb 2025): Proposer→Guardian→Verifier architectures generate architecture-specific tensor kernels, filtering every output through both functional tests (equality to PyTorch reference outputs) and static affine-type verifiers (enforcing dataflow and one-time-use constraints). Only demonstrably correct outputs are allowed to enter the demonstration memory and influence subsequent model generations, ensuring that the agent never decorrelates from the intended specification.
Auto-RCA for Root Cause Analysis (Wu et al., 24 Jul 2025): Iterative code repair in highly structured agent frameworks uses an empirical “regression-prevention” oracle: any proposed patch is only accepted if it strictly improves or maintains macro-averaged $F_1$ on an expert-verified benchmark, thus ensuring monotonic semantic progress and precluding regressions in correctness.

4. Algorithmic Structures Ensuring Correctness Preservation

The formal correctness loop in agentic optimizers typically implements:

Generation: Propose code variants via an LLM (planner, proposer) conditioned on available context and memory.
Verification/Evaluation: Execute comprehensive correctness checks—test suites, static property verification, performance regression analyses.
Acceptance: Only accept a code variant $C'$ over current best $C$ iff

$\mathsf{Verify}(C') = \text{pass} \quad \text{and} \quad \mathsf{Perf}(C') \geq \mathsf{Perf}(C)$

Archival/Memory Update: Retain only verified, non-regressive variants and use them as experience for further iterations. Both positive (improved) and negative (failed) rewrites are archived for downstream conditioning, avoiding repeated regressions (Zhang et al., 19 Nov 2025).

This protocol can be formalized as a bandit or empirical optimization process with hard constraints on correctness.

Table: Key Verifier Components in Agentic Optimization Frameworks

System/Paper	Verification Modules	Admitted Failure Type
AccelOpt (Zhang et al., 19 Nov 2025)	Profiler (correctness+latency), memory curation	Functional/test failure, performance regression discarded
Adaptive ML Lib. (Zhang et al., 4 Feb 2025)	Functional unit-test, affine-type AST check	Fails functional or typing test
Auto-RCA (Wu et al., 24 Jul 2025)	End-to-end benchmark $F_1$ , bad case analysis	Only code yielding $F_1 \uparrow$ accepted

5. Evaluation Criteria and Empirical Guarantees

Quantitative and property-based evaluation is central to both classical and agentic formal compiler correctness:

Functional coverage: Fraction of benchmark inputs for which compiled variants produce correct outputs relative to a formal or empirical reference.
Regret: For self-improving or empirical frameworks, cumulative regret, i.e., the performance or correctness loss relative to the optimal variant over iterations, is minimized by enforcing that only improving or equivalently correct transformations are retained (Zhang et al., 19 Nov 2025, Wu et al., 24 Jul 2025).
Property preservation: For domain-specific properties (e.g., affine usage in STeP code (Zhang et al., 4 Feb 2025)), failure to uphold the formal property leads to outright rejection upstream of demonstration memory.
Monotonicity: Guarantee that no transformation in the optimization loop can degrade correctness metrics, under all observed test cases and static checkers.

Empirically, such frameworks consistently demonstrate strict monotonic improvement curves with plateaus corresponding to hardware or specification bounds (e.g., AccelOpt improves average percentage-of-peak throughput from 49% → 61% on real kernels, always under correctness constraints; failure to meet regression-free criteria results in immediate variant rejection (Zhang et al., 19 Nov 2025)).

6. Domain Generalization and Applicability

While historically formal compiler correctness has required manual proofs for each source–target pair, agentic and self-improving approaches have demonstrated transfer across:

Domains: ML operator libraries, hardware kernel DSLs, root cause graphs, database query plans (Zhang et al., 4 Feb 2025, Zhang et al., 19 Nov 2025).
Properties: Arbitrary user-defined constraints, encompassing both functional, type, and algebraic invariants.
Optimizers: The same beam-search + memory-curation or bandit acceptance-reject loop is effective in numerical, symbolic, and workflow settings, provided an executable correctness verifier exists.

This universality is contingent not on the agent’s generative power but on the correctness filter’s strength and domain relevance.

7. Practical Considerations and Future Directions

Empirical vs. proof-based correctness: While full formal verification remains the gold standard in safety-critical domains, hybrid approaches that leverage extensive empirical validation alongside selective static verification are dominating rapidly iterating agentic compiler frameworks (Zhang et al., 19 Nov 2025, Zhang et al., 4 Feb 2025).
Memory and Negative Examples: Storing not only positive but also negative rewrites in agentic optimization memory prevents cyclical errors and accelerates convergence, a mechanism absent from legacy compiler frameworks (Zhang et al., 19 Nov 2025).
Early stopping and saturation: Once the correctness-constrained optimization loop plateaus due to hardware or semantic bounds (e.g., %peak > 80%), there is negligible value in further code evolution.
Modularization: Separation of planning/generation, execution/evaluation, summarization, and memory within agent architecture enables tractable diagnosis and formal reasoning about correctness propagation and failure isolation (Zhang et al., 19 Nov 2025).

In summary, formal compiler correctness in modern, agentic, and self-improving systems is achieved not through a single monolithic verification, but via layered, memory-augmented, closed-loop protocols that guarantee each transformation is rigorously tested and only admitted if correctness constraints hold. This approach supports rapid, robust, and domain-agnostic evolution of high-performance code without sacrificing semantic guarantees (Zhang et al., 19 Nov 2025, Zhang et al., 4 Feb 2025, Wu et al., 24 Jul 2025).