CoCoMagic: Differential Testing for Autonomous Systems

Updated 27 September 2025

CoCoMagic is a framework for differential testing that integrates metamorphic testing, cooperative co-evolution, and rule-based interpretability to expose system behavior differences.
It employs a dual-population co-evolutionary search that evolves realistic source scenarios and perturbations to uncover critical regressions.
The approach offers actionable diagnostics via RuleFit interpretability, achieving up to 287% higher detection of high-severity differences compared to baseline methods.

CoCoMagic refers to a methodological framework for differential testing of autonomous systems undergoing continuous development, emphasizing both the effective exposure of behavioral differences across system versions and the provision of interpretable diagnostics for practitioners. The approach systematically integrates metamorphic testing, differential testing, cooperative co-evolutionary search, and rule-based interpretability, achieving significant advances in scenario realism, detection of system discrepancies, and actionable debugging support (Yousefizadeh et al., 20 Sep 2025).

1. Framework Overview

CoCoMagic formalizes automated test case generation for evolving autonomous systems—exemplified by autonomous driving platforms—as a constrained cooperative co-evolutionary search problem. The method is designed to address three central challenges: vast scenario space, lack of reliable test oracles for complex behaviors, and the need for interpretable diagnostics. It accomplishes these objectives through:

Decomposition of test input into two co-evolving populations: “source scenarios” (reflecting realistic operating conditions) and “perturbations” (derived via predefined metamorphic relations).
Constraint-driven search processes that filter unrealistic or irrelevant test instances, maintaining scenario plausibility in simulation contexts such as Carla.
A fitness function that quantifies behavioral divergence across system versions according to violations of metamorphic relations.

This methodological synthesis improves the granularity and relevance of the identified behavioral differences while maintaining simulation efficiency.

2. Dual-Population Coevolutionary Search

Central to the approach is the dual-population cooperative co-evolutionary algorithm:

Source scenarios are evolved to cover a diverse and representative spectrum of real-world states for the autonomous system under test.
Perturbations, structurally specified via metamorphic relations, are evolved simultaneously to maximize their potential for exposing inconsistencies or regressions in system behavior.
The algorithm iteratively pairs source scenarios with perturbations, forming composite test cases that are applied to both the reference and updated versions of the target system.
Fitness is assigned based on the measured difference in metamorphic relation violations across the two versions. Individual-level fitness evaluation promotes diversity and specialization, enabling the emergence of novel scenario-perturbation combinations that maximize behavioral divergence.

This co-evolutionary formulation addresses the combinatorial explosion inherent in high-dimensional input spaces and supports the discovery of edge cases that traditional random or single-population search strategies systematically miss.

3. Constraints, Initialization, and Realism

The realism and relevance of test scenarios are enforced via several key mechanisms:

Population initialization leverages real execution traces and representative scenario data from the target system (e.g., Carla simulation logs), reducing the risk of generating implausible or irrelevant edge cases.
Constraints penalize test cases that deviate excessively from realistic operational conditions, balancing exploration at the edges of system capabilities against practical applicability for downstream debugging.
Each candidate test case undergoes evaluation against stored “execution scenarios” using a dissimilarity metric, ensuring that finalized tests represent feasible states in the deployment domain.

This focus on realism is critical for transferability of behavioral insights from simulation to real-world system operation.

4. Fitness Function Design

CoCoMagic’s fitness function is explicitly defined with respect to metamorphic relation violations:

For each test case, the extent of MR violation in both the reference and updated system versions is computed.
The primary fitness signal for the search algorithm is the absolute difference in MR violations, identifying cases where new system versions diverge significantly from established behavioral invariants.
Individual scenario or perturbation contributions are weighted by their best pairings from the opposite population, biasing search pressure toward diverse, high-impact test cases.

By directly quantifying behavioral regression as a function of MR violations, the fitness function ensures that discovered discrepancies are both actionable and systematic.

5. Interpretability via Rule Extraction

CoCoMagic incorporates a dedicated interpretability module based on rule-based machine learning (RuleFit):

After scenario generation, all test cases and observed outcomes are fed to RuleFit, which extracts a compact set of “if–then” rules elucidating the conditions most predictive of behavioral divergence between system versions.
These rules are constructed to be human-readable and lend themselves directly to debugging workflows. For instance, a rule might specify that increased violation risk arises when the ego vehicle maintains its original lateral position under specified weather conditions and obstacle profiles.
The interpretability approach yields approximately 30 high-coverage rules per experiment, guiding developers toward the root causes of system regressions with actionable specificity.

This feature is vital for system maintainers in domains lacking formal test oracles, providing transparency and diagnostic leverage as autonomy models grow in complexity.

6. Empirical Performance and Comparative Evaluation

Experimental evaluation of CoCoMagic on InterFuser (a state-of-the-art autonomous driving system) in the Carla simulation environment demonstrates:

Up to 287% more distinct high-severity behavioral differences are identified compared to baseline approaches such as single-population genetic algorithms or random search.
Higher per-case fitness scores and coverage ratios, indicating improved sensitivity and specificity in differential testing.
Superior scenario realism, with test cases maintaining practical relevance for engineers and system verifiers.
Efficient use of simulation budget and execution resources, suggesting scalability as system complexity increases.

These empirical results validate the efficacy of cooperative co-evolution and dedicated interpretability modules for system-level regression testing.

7. Practical Implications and Domain Impact

CoCoMagic offers strategic capabilities for ongoing safety assurance in autonomous systems:

Automated generation of interpretable test cases that catch subtle behavioral regressions, supporting rigorous differential testing as systems evolve.
Provision of actionable, rule-based diagnostics that assist developers in targeting and correcting spurious changes or safety hazards.
Balancing the need for exploratory stress testing with the practical imperative of scenario realism and system-domain applicability.
Enhanced assurance in environments characterized by frequent model updates without reliable oracles, particularly where deep learning components may introduce unpredictable behavioral shifts.

A plausible implication is that the techniques underlying CoCoMagic could be adapted for other complex AI-driven domains facing similar regression and interpretability challenges.

Summary Table: Core Features of CoCoMagic

Feature	Description	Domain Significance
Dual-population coevolution	Evolving scenarios and perturbations jointly for maximal divergence	High sensitivity in testing
Constraint-driven realism	Filtering to ensure scenario plausibility	Transferable diagnostics
Metamorphic relation fitness	Quantified regression via MR violation differentials	Systematic discrepancy search
RuleFit interpretability	Extraction of human-readable rules explaining observed differences	Actionable debugging support
Simulation efficiency	High discovery rate with modest computational resources	Scalable to complex systems

CoCoMagic establishes a rigorous and interpretable paradigm for differential testing, tailored to the evolving landscape of autonomous and artificial intelligence-driven systems. By integrating metamorphic test construction, cooperative co-evolutionary search, and dedicated interpretability mechanisms, it facilitates robust assurance and targeted diagnostics, contributing directly to safer, more reliable system engineering (Yousefizadeh et al., 20 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Constrained Co-evolutionary Metamorphic Differential Testing for Autonomous Systems with an Interpretability Approach (2025)

Follow Topic

Get notified by email when new papers are published related to CoCoMagic.