Behavior-Preserving Code Changes

Updated 9 December 2025

Behavior-preserving code changes are program transformations that maintain identical outputs and side-effects, ensuring semantic equivalence across all inputs.
They leverage formal methods, static analysis, and dynamic testing to validate refactoring, micro-optimizations, and library substitutions, enhancing software maintenance and performance.
Automated frameworks like SecEr and mutation testing exemplify how rigorous equivalence checking and risk diagnosis can foster reliable, efficient code evolution.

Behavior-preserving code changes are program transformations that modify source code while rigorously ensuring that the external behaviors—outputs, side-effects, and observable runtime states—remain unchanged for all inputs. These transformations encompass not only refactorings traditionally supported by software tools but also micro-optimizations, algebraic rewrites, library substitutions, and certain compiler passes. They are critical for code evolution, optimization, and maintenance, enabling developers to restructure or enhance their systems with theoretical and empirical guarantees of semantic equivalence.

1. Formal Definitions and Foundations

The defining characteristic of a behavior-preserving code change is semantic equivalence. Formally, let $P$ and $P'$ be two programs, and let $\sigma$ range over all valid input states. Behavior preservation requires: $\forall\,\sigma~.~\mathrm{exec}(P,\,\sigma)\,\equiv\,\mathrm{exec}(P',\,\sigma)$ where $\mathrm{exec}(P,\,\sigma)$ denotes the observable output or final state after executing $P$ from initial state $\sigma$ (AlOmar et al., 2021). This implies a semantic equivalence relation $\cong$ over program texts such that $P\cong P'$ if they realize the same input-output mapping and side-effect set. In practice, total equivalence is usually assessed modulo a sufficiently exhaustive test suite or, in some frameworks, instrumented trace comparison at selected program locations (e.g., “points of interest” or POIs) (Insa et al., 2017).

For transformation-based approaches, a semantic-preserving transformation (SPT) is a mapping $T:X\to X$ such that $\forall x\in X\,.\,A(x,T(x))=1$ with $A$ a functional equivalence oracle (Hooda et al., 5 Dec 2025). For dynamic regression testing, equivalence is typically defined as identical test outcomes for all tests in a suite, or as trace equivalence at POIs: $\tau_1(i) \equiv \tau_2(i)$ for all test inputs $i$ and evaluation traces $\tau$ collected from the two program versions (Insa et al., 2017).

2. Taxonomy of Approaches

Research on behavior preservation encompasses several distinct methodologies:

Formal methods: Employ model-theoretic or operational semantics, precondition calculi, and graph transformations. Behavior-preserving operations are shown correct by induction, rewriting logic, or simulation proofs. An example is the mechanically verified global variable renaming in CompCert C (Cohen, 2016), where correspondence of program behaviors is proven via forward and backward simulation in Coq.
Static analyses and refactoring safety tools: Integrated into IDEs (e.g., Eclipse JDT, SafeRefactor, Refactoring Browser) to ensure safe application of refactorings by checking semantic model invariants, name bindings, and type constraints. Differential preservation analyses compare semantic graphs before and after transformation (AlOmar et al., 2021, Brinksma et al., 13 Nov 2024).
Dynamic and regression testing: Execute pre-existing or automatically generated test suites on both versions and compare outcomes to enforce behavioral equivalence. Mutation testing can serve as a safety net when refactoring test code itself (Parsai et al., 2015). Tools like SecEr synthesize regression test cases instrumented at arbitrary POIs to detect behavioral divergence at runtime (Insa et al., 2017, Insa et al., 2018).
Manual and heuristic approaches: Perform inspection of code diffs, analyze commit messages, or rely on developer expertise to tag and validate semantic-preserving edits. While accessible, these are less rigorous and scale poorly (AlOmar et al., 2021).

3. Notable Automated Frameworks and Empirical Results

Several open-source and experimental tools exemplify automated approaches:

SecEr (Erlang Code Evolution Control):

Formalizes behavior equivalence via trace comparison at user-selected POIs. For modules $M_1, M_2$ , and input function set $F$ , define $\tau_1(i), \tau_2(i)$ as the sequences of POI values for input $i$ .
Declares behavior preservation if $\forall i \in T.\; \tau_1(i) = \tau_2(i)$ for generated test suite $T$ ; mismatches are quantified as rates over $T$ .
Orchestrates TypEr (type inference), CutEr (concolic path enumeration), and PropEr (random/mutation-guided input generation) to cover input domains and program points.
Experimental highlights include detecting zero mismatches for correct optimizations, and precisely localizing bugs in semantic refactorings with mismatch rates ranging from 8.8% to 91% for intentionally injected errors (Insa et al., 2017, Insa et al., 2018).

Mutation Testing for Test Suite Refactoring:

Scores test suites $P_1$ (original), $P_2$ (refactored) by the proportion of mutants killed: $\mu(P_i) = \frac{\text{killed}(P_i)}{|\mathcal{M}|}$ .
Behavior preservation is certified if $\mu(P_2) = \mu(P_1)$ ; any decrease indicates a weakening or unintended alteration (Parsai et al., 2015).
Differential killing pinpoints altered assertions; in empirical studies, mutation scores detected behavioral drift not captured by statement or branch coverage.

Renaming Global Variables in C (CompCert C):

Proves behavior preservation for the renaming transform via simulation relations in Coq, encompassing trace-by-trace correspondence of events, careful handling of shadowing/capture, and validation against external linkage (Cohen, 2016).

Auto-SPT (Semantic Preserving Transformations via LLMs):

Utilizes LLMs to propose, synthesize, and compose SPTs. Measures transformation "strength" (diameter) and diversity through adversarial search against code clone detectors.
Compositions of diverse SPTs dramatically degrade model performance—e.g., clone-detection confidence drops from 93% to <10% upon strong SPT application—and also facilitate adversarial robustness augmentation for ML models (Hooda et al., 5 Dec 2025).

4. Decomposition, Granularity, and Coverage

Empirical decomposition of behavior-preserving changes into refactorings and primitive operations reveals significant complexity:

RefactoringMiner-based studies on 100 OSS Java method pairs show only 33.9% coverage via standard refactoring templates. An extended catalog of 67 atomic operations (algebraic rewrites, conditional transformations, binding refinements, API substitutions) raises coverage to 77.5%, but approximately 22.5% of differences remain unexplained, mainly due to algorithmic or large-structure rewrites (Someya et al., 16 Aug 2025).
Most widely covered refactorings are method renaming and push/pull up/down; under-researched operations include control-flow reorganizations, dynamic-feature interactions, and model-level transformations (UML/Alloy) (AlOmar et al., 2021).

Refactoring Detector	Avg. Coverage	Catalog Expansion Effect
RMiner (default)	33.9%	—
+ 67 operations	77.5%	+128.6%

Among transformation techniques, statement addition yields the highest neutral variant rates (≈52%), followed by deletion (≈22%) and replacement (≈15%), with higher-order targeted transformations (method calls, subtype swaps, loop flipping) reaching up to 73% (Harrand et al., 2019).

5. Diagnostic Models and Risk Management

Recent advances focus on explicit risk diagnosis and mitigation:

ReFD models refactoring as sequences of microsteps ( $\langle m_1, m_2, ..., m_k \rangle$ ), each associated with a set of potential risks ( $PR(m)$ ). Risks are detected by context-sensitive analyzers, and a verdict mechanism filters those neutralized by subsequent microsteps, resulting in actionable warnings for the developer. Representative risks include double definitions, broken subtyping, lost specification, and missing definitions (Brinksma et al., 13 Nov 2024).
Danger diagnosis is statically performed using detectors chained on program graphs (augmenting ASTs), localizing hazards to precise program locations before any code is committed, thus preempting silent semantic errors (Brinksma et al., 13 Nov 2024).

6. Applications, Limitations, and Hybrid Strategies

Applications span:

Source code optimization: Guided search for equivalent mutants that improve non-functional properties (performance), using test-based equivalence as the validity check. Orders-of-magnitude speedups can be achieved that surpass traditional compiler optimizations, provided the test suite is sufficiently exhaustive (López et al., 2018).
Software diversity, approximate computing, and genetic improvement: Plastic code regions (AST contexts amenable to semantic-preserving change) can support large pools of neutral program variants for reliability and energy/resource trade-offs (Harrand et al., 2019).
ML-based security and refactoring robustness: Data augmentation and adversarial SPT composition improve resistance of ML clone/vulnerability detectors to real-world code changes (Hooda et al., 5 Dec 2025).

Limitations include susceptibility to test suite coverage (M-equivalence vs. true semantic equivalence), scalability of proof-based approaches, operator catalog completeness, and, for dynamic techniques, the intractability of exhaustive input enumeration. Hybrid pipelines that combine static analysis, dynamic trace comparison, and formal verification provide the most rigorous coverage, with recommendations for extensible libraries of preconditions and semantic schemas to facilitate cross-language support (AlOmar et al., 2021).

7. Future Directions and Open Challenges

Key challenges for advancing the field include:

Automated inference of macro-level behavior-preserving rewrites spanning method/class boundaries;
Integration of dynamic invariant mining and symbolic equivalence checks for scalable validation;
Lower-barrier DSLs for user-extensible definitions of semantic-preserving operations;
Improved model-driven tooling, leveraging semantic graphs and risk diagnosis to provide transparent, actionable feedback in real time (Someya et al., 16 Aug 2025, Brinksma et al., 13 Nov 2024);
Development of frameworks for robust ML model training on fully adversarial, yet semantics-preserving code transformations (Hooda et al., 5 Dec 2025).

Behavior-preserving code changes thus sit at the intersection of formal semantics, software engineering automation, dynamic test generation, and machine learning robustness, serving as the foundational principle underlying refactoring safety, optimization fidelity, and resilient software evolution.