Papers
Topics
Authors
Recent
Search
2000 character limit reached

Automatic Problem Fingerprinting

Updated 7 January 2026
  • The paper introduces an algorithmic method to synthesize unique EM fingerprints from firmware execution paths, eliminating laborious manual trace collection.
  • It constructs a synthetic library from instruction pair transitions, achieving high detection accuracy with only minor drops in AUC and ACC.
  • The approach scales to complex firmware updates and can extend to other side channels, though it requires careful handling of environmental noise.

Automatic problem fingerprinting, in the context of electromagnetic (EM)-based anomaly detection for embedded devices, refers to the process of generating unique, path-specific EM signal “fingerprints” corresponding to different execution paths in device firmware or software. Traditional fingerprinting requires the manual collection of EM traces for each code branch, a task that is labor-intensive and non-scalable due to the combinatorial explosion of possible execution paths even in simple programs. The method of automatic problem fingerprinting, as introduced by Vedros et al., synthesizes these EM signal traces algorithmically from the machine code, eliminating the need for exhaustive on-device measurement and substantially increasing scalability for downstream security mechanisms (Vedros et al., 2023).

1. Mathematical Framework for Synthetic EM Generation

Let C={c1,,cn}C = \{c_1, \dots, c_n\} denote the complete set of assembly instructions in the processor's instruction set architecture (ISA). Each instruction pair (ci1,ci)(c_{i-1}, c_i) defines a transition whose EM emissions are characterized by a discrete sample window Si,i1={s1,,sm}S_{i,i-1} = \{s_1, \dots, s_m\}, where mm reflects the temporal granularity imposed by EM sampling rate and instruction timing.

A library M\mathcal{M} of EM “building blocks” is constructed: M={((ci1ci),Si,i1)ci,ci1C}\mathcal{M} = \{ ((c_{i-1}|c_i), S_{i,i-1}) \mid c_i, c_{i-1} \in C \} An execution path, I=(c1,c2,,cL)I = (c_1, c_2, \dots, c_L), can thus be mapped to a synthetic EM trace via

S=f(I)=[S2,1S3,2SL,L1]S' = f(I) = [S_{2,1} \| S_{3,2} \| \dots \| S_{L,L-1}]

The fidelity between synthetic and real EM signals is quantified through the normalized Euclidean distance

NED(A,B)=0.5Var(AB)Var(A)+Var(B)\mathrm{NED}(A,B) = 0.5 \frac{\mathrm{Var}(A-B)}{\mathrm{Var}(A) + \mathrm{Var}(B)}

with the goal that D(Sreal(I),f(I))0D(S_{\mathrm{real}(I)}, f(I)) \approx 0.

2. Synthetic Fingerprint Construction Pipeline

The construction pipeline is divided into offline and online phases:

  • Offline: Block Library Compilation
    • Micro-benchmarks are compiled to exercise the sequence ,ci1,ci,\dots, c_{i-1}, c_i, \dots, padding with nop to enforce alignment.
    • NN instances of each sequence’s EM traces are measured, windowed, and stored in M\mathcal{M}.
  • Online: Path-Specific Synthesis
    • Given a binary’s control-flow graph (CFG) breakdown into an instruction path {c1,,cL}\{c_1,\dots,c_L\}, the system reconstructs a synthetic EM trace SS’ by concatenating randomly sampled EM blocks from the respective (ck1ck)(c_{k-1}|c_k) entries in M\mathcal{M} for each transition in the CFG path.

Pseudocode:

1
2
3
4
5
6
S_prime = []
for k = 2 to L:
    prev, curr = c_{k-1}, c_k
    S_block = pick_random(M[(prev|curr)])
    S_prime = concatenate(S_prime, S_block)
return S_prime
Each path through the CFG is thus “played back” at the EM-signal level, entirely from a library of instruction pairwise blocks.

3. Machine Learning Architecture and Detection Protocol

Detection is formulated as a semi-supervised anomaly detection task due to the open-endedness of possible malicious behaviors.

  • Feature Extraction: Each EM trace—synthetic or real—is downsampled or peak-aligned to a fixed-length feature vector xRTx \in \mathbb{R}^T, using amplitude peaks at instruction-cycle boundaries to mitigate minor clock drifts.
  • Anomaly Detector: A kk-nearest-neighbors (kNN) strangeness-score system. No parametric model is trained; the method builds a non-parametric distribution of benign “normal” strangeness scores for each CFG path.
  • Strangeness Score:

strange(q;X)=xNk(q;X)qx2\mathrm{strange}(q;X) = \sum_{x \in N_k(q;X)} \| q - x \|_2

where Nk(q;X)N_k(q; X) is the set of kk nearest neighbors to qq in the benign reference set XX.

  • Fingerprinting Baseline: For each benign path ii, a reference set XiX_i of traces is built from synthetic traces, and the multiset BiB_i of strangeness scores is computed:

Bi={strange(x;Xi)xXi}B_i = \{ \mathrm{strange}(x;X_i) \mid x \in X_i \}

  • Anomaly Decision: For an observed trace qq, compute strangeness α\alpha relative to path ii, then its empirical pp-value:

pi(α)=1+{βBi    βα}1+Bip_i(\alpha) = \frac{1 + |\{ \beta \in B_i \; | \; \beta \geq \alpha \}|}{1 + |B_i|}

If pi(α)τp_i(\alpha) \le \tau, qq is flagged as anomalous for path ii. If any pi(α)>τp_i(\alpha) > \tau, qq is labeled normal.

  • Threshold Selection: The rejection threshold τ\tau is tuned via cross-validation. ROC and AUC, along with accuracy (ACC) and F1-score, are reported by varying τ\tau.

4. Experimental Protocol and Empirical Outcomes

The method was evaluated using the following experimental protocol:

  • Hardware/Platform: Arduino Mega (ATmega2560), near-field EM antenna, oscilloscope at 500 MS/s.
  • Software: Program A (17-instruction loop), Program B (benign update, 17 different instructions), two malicious variants (“B-easy”: 4 injected instructions; “B-hard”: 2 injected instructions).
  • Datasets: 1000 real traces for each program; 1000 synthetic B traces.
  • Evaluation: 10-fold cross-validation, balanced benign/anomaly split, k=10k=10.
  • Metrics: Area Under Curve (AUC), accuracy (ACC), F1-score.

Summary of key results:

Training Data Test Variant AUC ACC F1 Relative Δ (vs real-only)
real A & real B easy 0.993 99.8% 99.9% --
real A & real B hard 0.993 99.9% 99.5% --
real A & synthetic B easy 0.980 95.4% 95.5% –1.3% AUC, –4.4% ACC
real A & synthetic B hard 0.951 90.1% 90.6% –4.2% AUC, –9.8% ACC

Synthetic EM fingerprinting incurs only a –1.3% AUC drop in detecting minimal code injection relative to real-signal training. Accuracy and F1-score losses are similarly modest, supporting the method’s suitability in high-integrity contexts (Vedros et al., 2023).

5. Scalability, Limitations, and Required Conditions

  • Scalability: The method eliminates human labor in EM trace collection for each possible code path. Once a library M\mathcal{M} of EM blocks is constructed, it can be reused across firmware builds and even crowdsourced for given architectures. For firmware updates, only code parsing and lookup in M\mathcal{M} are necessary; no new EM measurements are required.
  • Library Size: M\mathcal{M} grows as O(C2)\mathcal{O}(|C|^2), requiring storage for every instruction pair. For large ISAs, this is a substantial but not necessarily prohibitive requirement.
  • Modeling Limitations: Real hardware may exhibit higher-order dependencies—for example, the effect of ci2c_{i-2} or simultaneous memory accesses on EM signatures—that are not captured when only modeling instruction pairs. The approach assumes that synthetic signals, generated by concatenating pairwise blocks, suffice for anomaly detection performance.
  • Environmental Factors: Noise from concurrent tasks or environmental factors will not be captured in synthetic traces. Deployed systems must address these confounders accordingly.

6. Prospects for Generalization and Future Approaches

Generalization is possible through several routes:

  • ISA Reduction: Cluster similar instructions (e.g., grouping arithmetic operations) to reduce the unique pairs in M\mathcal{M}.
  • Parametric Encoding: Replace the discrete library by learning a parametric kernel, e.g., using generative models such as GANs, to synthesize EM signals directly from code, bypassing strict library size growth.
  • Side-Channel Extension: The methodology generalizes to other side-channel modalities—power, acoustic, cache-access, etc.—by building minimal building-block libraries and reconstructing signals via concatenation.

A plausible implication is that automated, large-scale, and non-intrusive fingerprinting as enabled by algorithmic synthesis substantially broadens the applicability of EM-based anomaly detection in embedded and IoT systems, circumventing the bottleneck of manual trace acquisition and tuning (Vedros et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Automatic Problem Fingerprinting.