Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Injection & Proactive Fault Injection

Updated 19 March 2026
  • Self-injection and proactive fault injection are techniques that intentionally introduce faults to assess and enhance system resilience.
  • They employ internal instrumentation and external orchestration to trigger runtime faults, validating recovery and self-healing mechanisms.
  • These methodologies inform design improvements by revealing vulnerabilities in embedded and IoT systems, ensuring robust fault tolerance.

Self-injection and proactive fault injection are methodological paradigms for evaluating and enhancing the dependability, fault-tolerance, and real-time correctness of both deeply embedded systems and distributed IoT ecosystems. These approaches employ intentionally triggered perturbations—originating either internally (“self-injection”) or by controlled external orchestration (“proactive fault injection”)—to rigorously assess system resilience and expose latent vulnerabilities in self-healing or recovery mechanisms under operational workloads (Magliano et al., 2024, Duarte et al., 2022).

1. Definitions and Conceptual Foundation

Self-injection refers to the direct instrumentation of a digital system—often at the application or firmware level—so that it autonomously introduces faults into its own execution context during runtime. In the context of embedded or IoT systems, this encompasses the embedding of fault-injection probes, schedulers, and in-place self-healing policies. The self-injecting node or module monitors for anomalies, injects predetermined or randomized faults according to configured rules, and locally executes recovery strategies such as majority voting, range-checking, or state compensation. Self-injection thus enables continuous, in-situ validation of fault-tolerance routines, without reliance on external test rigs or orchestrators (Duarte et al., 2022).

Proactive fault injection entails the deliberate and systematic introduction of faults into a system before the manifestation of latent defects or environmental disturbances. Rather than waiting for naturally occurring faults, the tester (external or internalized) triggers representative faults to exercise and validate the system’s resilience and self-repair logic. In distributed scenarios (e.g., IoT), this is exemplified by instrumented middleware (such as MQTT brokers) that inject faults (drops, corruptions, reorders) into network flows to evaluate the downstream system’s continuous dependability (Duarte et al., 2022). In deeply embedded real-time systems, proactive injections can target micro-architectural events, allowing for precise temporal correlation between fault occurrence and critical sections, deadlines, or safety margins (Magliano et al., 2024).

2. Reference Architectures and Implementation Techniques

2.1 Embedded Real-Time Fault-Injection Frameworks

The debug-based architecture described by Magliano et al. harnesses both JTAG debugging and the ARM PMU to implement bit-level controlled, repeatable injections targeted at registers, RAM, or the program counter. The architecture features:

  • Host-driven scripting (Python/Pexpect, XSCT Tcl) interfaced with JTAG (IEEE 1149.1) for breakpoint-triggered injection.
  • The ability to halt the core at a breakpoint, flip a programmed bit in a targeted structure (register or memory), and resume execution—all under cycle-level control.
  • Bit-flipping pseudo-code as below:
    1
    2
    3
    4
    5
    6
    
    def inject_bit_flip(bp_address, target_struct, bitpos):
        xsct.cmd(f"stop {bp_address}")
        word = xsct.read(target_struct.address)
        word ^= (1 << bitpos)
        xsct.write(target_struct.address, word)
        xsct.cmd("cont")
  • Target platforms include ARM Cortex-A9–based SoCs under FreeRTOS, running established benchmarks such as MiBench QSort, SHA, and Dijkstra (Magliano et al., 2024).

2.2 Proactive Self-Injection Extensions

A proactive self-injection extension, as outlined in (Magliano et al., 2024), leverages the PMU’s counter-overflow IRQ to autonomously trigger fault injections at defined microarchitectural event intervals (e.g., every 10610^6 retired instructions). In this paradigm, the faultInjectorTask configures the PMU, installs the ISR for overflows, and implements injection probability control:

  • Faults are injected within the ISR, indexed for precise logging and repeatability (tagged with fault_id, PMU count, timestamp, random seed).
  • Programmable density is achieved using stochastic thresholds:
    1
    2
    3
    
    if (random() < P_inj) {
        raise_software_breakpoint();
    }
  • Injection schedules can be tuned to avoid interference with hard real-time constraints by setting large TintT_{\mathrm{int}} and low-priority IRQ handling (Magliano et al., 2024).

2.3 Fault-Injection in IoT Middleware

In broker-based IoT settings, fault-injection add-ons intercept MQTT PUBLISH flows and apply operator pipelines to induce specific network or data anomalies:

  • Operator types include map(f) for payload transformation, randomDelay for latency, buffer/reorder, and randomDrop for probabilistic message omission.
  • Rules are dynamically configured (topic filters, activation windows) and executed prior to message delivery to subscribing applications, e.g., Node-RED flows augmented with SHEN nodes for self-healing.
  • The framework supports both static windowed and adaptive (monitor-triggered) injection schedules (Duarte et al., 2022).

3. Metrics and Evaluation Methodologies

3.1 Embedded and Real-Time Contexts

  • Timing Overhead Modeling: The delta in WCET (ΔC=CfaultCgolden\Delta C = C_{\mathrm{fault}} - C_{\mathrm{golden}}) under fault injection is modeled as Gaussian, ΔCN(μΔ,σΔ2)\Delta C \sim \mathcal{N}(\mu_{\Delta}, \sigma_{\Delta}^2).
  • Real-time Schedulability: Safety margins are padded to accommodate observed injection-induced overhead, requiring

Cgolden+μΔ+kσΔDC_{\mathrm{golden}} + \mu_{\Delta} + k\,\sigma_{\Delta} \leq D

for deadline DD and tolerance kk (e.g., k=3k=3 for 99.7%).

  • High-percentile WCET Estimate: The tight bound is set by order statistics over nn runs,

C^0.999=C(0.999n)\hat{C}_{0.999} = C_{(\lceil 0.999\, n\rceil)}

(Magliano et al., 2024).

3.2 IoT Fault Injection and Self-Healing

  • Fault Occurrence Rate: λ=Ninjected/Ttotal\lambda = N_\mathrm{injected} / T_\mathrm{total}
  • Fault Coverage: Cf=Nrecovered/NinjectedC_f = N_\mathrm{recovered} / N_\mathrm{injected}
  • Mean Time to Recovery (MTR):

MTR=1NrecoverediTrecovery,iMTR = \frac{1}{N_\mathrm{recovered}} \sum_i T_{\mathrm{recovery},i}

  • Reliability (R(t)R(t)): Probability that correct alarm levels persist up to tt, R(t)Pr{No unhandled fault in [0,t]}R(t)\approx Pr\{\text{No unhandled fault in}\ [0,t]\}
  • Availability (AA): A=(1/T)0TU(τ)dτA = (1/T) \int_0^T U(\tau)\,d\tau; U(τ)=1U(\tau)=1 when correct service.
  • Overlap Metric (OO): Fraction of output states matching the baseline, O=Nmatching/NtotalO=N_\mathrm{matching}/N_\mathrm{total} (Duarte et al., 2022).

4. Experimental Results and Comparative Analysis

4.1 Embedded Fault Injection

Empirical fault injection on an ARM Cortex-A9@650 MHz (PYNQ Z2 board) yielded distinct fault propensities by injection target:

Injection Site #Faults Avg Run Time Benign (%) SDC (%) Crash (%)
Reg (Dijkstra) 5,180 261 ms ≈95 4–11 ≈1
PC (Dijkstra) 155 30.3 s ≈10 ≈5 ≈85
Mem (Dijkstra) 3,330 260 ms ≈97 ≈1 ≈2

This demonstrates that PC injections are highly disruptive, while register/memory faults are predominantly benign or result in silent data corruption (SDC), with rare outright hangs/crashes. The JTAG+PMU approach provides precise single-bit, spatial, and temporal control, with full repeatability and minimal code perturbation (Magliano et al., 2024).

4.2 IoT Fault Injection and Self-Healing

Experiments leveraging a fault-injection–enhanced Aedes MQTT broker and Node-RED SHEN nodes quantified alarm recovery and coverage:

Experiment System Transitions Δ Overlap vs. BL (%)
S1E2: stuck-at FI 148 40.0
FI+SH 27 98.1
S1E3: spikes FI 51 76.3
FI+SH 25 97.4

Availability under stuck-at fault: A98%A \approx 98\% with self-healing, A40%A \approx 40\% without. Mean Time to Recovery in the testbed was MTR1.2MTR \approx 1.2 s. This indicates that proactive fault injection exposes deficiencies in naïve systems and demonstrates the effect of self-healing logic (range filtering, voting, compensation) in restoring correct output states (Duarte et al., 2022).

5. Practical Guidelines and Observations

  • Instrumenting systems with pluggable, programmable operator pipelines (for embedded or middleware brokers) is key for broad, automated coverage of fault scenarios.
  • Parameterize injections using well-defined activation counts, probabilities, and event-driven triggers, ensuring coverage of both static and dynamically emerging failure modes.
  • Logging each injection and outcome with full trace (ID, event count, timestamp, random seed) is critical for reproducibility and post-mortem analysis.
  • In embedded systems, injection routines must be carefully scheduled to prevent priority inversion or deadline violation; PMU-overflow ISRs should clear quickly (<<100 ns) and mark their own priority below that of critical tasks (Magliano et al., 2024).
  • In distributed IoT, a combination of timing, range, voting, and compensation filters should be embedded both at the data ingress and just before alarm actuation to cover a range of plausible anomalies.

6. Significance and Applications

Self-injection and proactive fault injection form indispensable methodologies for validating and stress-testing both safety-critical embedded systems (automotive, aerospace) and mission-critical distributed IoT deployments. The hybridization of JTAG/PMU-driven, single-bit-precise injection with in-node or broker-level self-injection enables practitioners to profile both correctness and timing resilience (jitter, deadline misses), while retaining bit-perfect replay and minimizing test intrusion. By exposing the interplay between fault model, injection scenario, and dependability metrics (availability, reliability, coverage), these approaches both reveal critical vulnerabilities and inform the hardening of in-situ self-healing architectures (Magliano et al., 2024, Duarte et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Injection and Proactive Fault Injection.