SIFT-SNN Framework for Anomaly Detection

Updated 1 December 2025

The SIFT-SNN framework integrates explicit SIFT feature extraction with latency-driven spike encoding, achieving sub-10 ms inference for real-time structural anomaly detection.
It employs a multilayer LIF spiking neural network to preserve spatial feature relationships, delivering high classification accuracy with low power consumption.
The framework demonstrates robust edge deployment performance by combining interpretable feature visualization with validated neuromorphic processing under real-world conditions.

The SIFT-SNN framework is a hybrid neuromorphic signal-processing pipeline designed for real-time structural anomaly detection in transport infrastructure, with a focus on movable concrete barrier systems. By integrating Scale-Invariant Feature Transform (SIFT) keypoint descriptors with a latency-driven spike encoding scheme and a multilayer Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN), the framework achieves sub-10 ms inference latency and supports low-power, interpretable, edge deployment. Unlike conventional CNN-based approaches, SIFT-SNN explicitly preserves the spatial relationships of input features, offering improved transparency and spatial grounding in decision-making (Rathee et al., 26 Nov 2025).

1. End-to-End Pipeline Architecture

The SIFT-SNN architecture proceeds in three sequential stages:

(a) Pre-processing & SIFT Keypoint Detection:

Each input is a high-resolution region of interest (ROI) around a safety pin. The image is converted to grayscale, histogram-equalised, and normalised to zero mean and unit variance. SIFT feature extraction is performed via scale-space extrema detection, using the Difference of Gaussians (DoG) formulation: $L(x,y,\sigma) = G(x,y;\sigma) * I(x,y),\quad D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma)$ Keypoints are located at local extrema of $D(x,y,\sigma)$ across space and scale. At each keypoint, gradient magnitude and orientation are calculated: $m(x,y)=\sqrt{(L(x+1,y)-L(x-1,y))^2+(L(x,y+1)-L(x,y-1))^2}$

$\theta(x,y)=\arctan2\left(L(x,y+1)-L(x,y-1),\,L(x+1,y)-L(x-1,y)\right)$

Around each keypoint, a $16 \times 16$ patch is divided into $4 \times 4$ regions, each region forming an 8-bin orientation histogram weighted by gradient magnitude. The resulting 128-dimensional vectors are L2-normalised and contrast thresholded. The top $N=100$ keypoints (ranked by DoG response) are retained, or zero-padded if fewer are found, forming a 12,800-element vector per frame.

(b) Latency-Driven Spike Conversion:

The 12,800-D descriptor is normalised to $[0,1]$ , then a one-spike-per-channel latency code is applied: $t_i = T(1 - x_i),\qquad i=1,\dots,12,800,\quad T=100\ \mathrm{ms}$ where high-salience features (large $x_i$ ) result in earlier spikes. Only one spike per channel is used, yielding average spike activity of 8.1%. The original spatial ordering is preserved, ensuring spatial neighborhoods in the image correspond to proximal spike times.

(c) LIF Spiking Neural Network Classification:

The resulting spike trains are processed by a feed-forward SNN with the following architecture:

Input: 12,800 spike channels
Hidden layer 1: 512 LIF neurons
Hidden layer 2: 128 LIF neurons
Output: 2 LIF neurons (“Pin_OK” vs. “Pin_OUT”)

Each LIF neuron is governed by: $\tau_m \frac{dV(t)}{dt} = -\bigl(V(t)-V_{rest}\bigr) + R\,I(t),$ with standard spike-reset and refractory period parameters. Final classification is performed by comparing spike counts of the output neurons within the stimulus window.

2. Spike Encoding and Neuron Model

The single-spike, latency-based code maps the continuous [0,1] descriptor values linearly onto spike times within a 100 ms observation window, with 1 ms simulation resolution. This code achieves significant temporal sparsity (mean 8.1% activity rate) and is designed to propagate the spatial structure of input features through the network by maintaining channel ordering.

Neuron dynamics follow the LIF model with parameters:

Membrane time constant $\tau_m = 20$ ms,
Rest/reset potential $V_{rest} = V_{reset} = 0$ ,
Threshold $V_{th} = 1$ ,
Resistance $R=1$ ,
Absolute refractory period: 2 ms.

These settings deliver compute efficiency suitable for edge environments while enabling the network to exploit timing precision for classification.

3. Network Training Regimen

Training is performed in PyTorch with snnTorch and cross-verified in Brian2. All-to-all connectivity is used between layers. Loss minimisation is via surrogate-gradient descent using Adam, with initial learning rate 0.001 halved by 0.95 per epoch over 50 epochs (converging by epoch 45). Batch size is 64. Weight initialisation is by He uniform (PyTorch default). No explicit L1/L2 regularisation is applied, as the latency code enforces sparse activity. The binary cross-entropy loss function is used for the two-class problem.

4. Empirical Performance and Resource Usage

On a held-out test cohort (900 frames, 15% of the complete Auckland Harbour Bridge dataset), SIFT-SNN achieves:

Classification accuracy: $92.3\% \pm 0.8\%$ ,
F1 score: 91.0%,
Precision (for unsafe “Pin_OUT”): 86.0%,
Recall (for unsafe “Pin_OUT”): 88.0%,
Mean inference latency: 9.5 ms/frame (GPU), 26 ms/frame (CPU),
Mean spike activity per frame: 8.1%.

System-level power draw during GPU inference is ~35 W, and ~20 W in CPU-only mode. The framework thus supports sub-10 ms, sub-30 ms per-frame throughput for embedded deployment.

5. Implementation Context and Deployment Characteristics

All experiments were executed on a consumer-grade platform (Intel Core i7-13650HX, 64 GB DDR5, NVIDIA RTX 4060) using Python 3.10, OpenCV 4.x, PyTorch 2.0, snnTorch, and Brian2. Edge deployment is demonstrated on a CPU-only laptop, consistently delivering <30 ms per-frame latency. Key system optimisations include:

Use of a fixed, 100 ms temporal window per inference for deterministic timing,
One-spike latency coding for channel activity minimisation,
Top-100 SIFT keypoint restriction to bound input dimensionality.

6. Interpretability, Augmentation, and Generalisation Considerations

By leveraging explicit SIFT descriptors as the basis for the spike code, the SIFT-SNN approach allows for visualisation and direct spatial attribution of salient features. Spike-latency heatmaps can be overlaid on the original ROI, correlating early (high-value) spikes with decisive geometric cues such as pin edges and corners.

Dataset augmentation compensates for the scarcity of unsafe examples. Techniques include perspective warping, illumination shifts, occlusion overlays, ±10° rotation, positional jitter, and morphological distortions, achieving an approximate 3:1 safe-to-unsafe class ratio. While this prevents overfitting to limited real unsafe data and improves robustness to these perturbations, generalisation to genuinely novel field conditions (e.g., unprecedented weather or occlusions) has not been fully established and is highlighted as an area for future validation (Rathee et al., 26 Nov 2025). Boundary conditions related to population drift or hardware-specific timing accuracy similarly remain as open questions.

In sum, the SIFT-SNN framework constitutes a neuromorphic, interpretable, and computationally efficient pipeline for anomaly detection in transport infrastructure, with specificity for safety-critical, spatially localised classification under real-world constraints.

PDF Markdown Chat (Pro)

References (1)

Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SIFT-SNN Framework.