Papers
Topics
Authors
Recent
2000 character limit reached

SpikeViMFormer: SNN Geo-Localization

Updated 29 December 2025
  • The paper presents SpikeViMFormer, a spike-driven transformer framework that achieves state-of-the-art drone-view geo-localization with significantly lower energy consumption.
  • It integrates spike-driven selective attention and hybrid state-space blocks to recover critical features and capture both local details and long-range dependencies.
  • Empirical evaluations show that SpikeViMFormer-S attains competitive accuracy compared to heavy ANN models while using 8.4× fewer parameters and 13.24× less energy.

SpikeViMFormer is a spike-driven transformer framework specifically designed for the high-efficiency, high-accuracy demands of drone-view geo-localization (DVGL) using Spiking Neural Networks (SNNs). This architecture directly addresses the challenges of information loss and limited long-range dependency modeling endemic to SNNs under sparse activation regimes. It achieves state-of-the-art (SOTA) performance among SNN systems and competitive results relative to advanced Artificial Neural Networks (ANNs), but at several-fold lower computational and energy cost due to spike-based computation (Chen et al., 22 Dec 2025).

1. Principle Architecture and Computational Flow

SpikeViMFormer comprises a lightweight spike-driven transformer backbone, referred to as Bθ\mathcal{B}_\theta. The model processes paired images (drone and satellite, Idi,IsjR384×384I_d^i, I_s^j \in \mathbb{R}^{384\times384}), extracting representations via a sequence of SNN-ified modules:

  • Backbone Structure: The backbone contains two convolutional SNN blocks and two transformer SNN blocks, with interleaved downsampling layers. During inference, only Bθ\mathcal{B}_\theta is retained.
  • Neuron Model and SNN Operations: SpikeViMFormer uses Leaky Integrate-and-Fire (LIF) neurons, with both classical "soft-reset" and the more stable Normalized Integer LIF (NI-LIF) variant. For NI-LIF, integer spikes are normalized according to U[t]=H[t1]+X[t]U[t]=H[t-1]+X[t], S[t]=Clip(round(U[t]),0,D)/DS[t]=\mathrm{Clip}(\mathrm{round}(U[t]),0,D)/D, H[t]=β(U[t]S[t]×D)H[t]=\beta\bigl(U[t]-S[t]\times D\bigr).
  • Spike-Driven Modules: All convolutions, attention blocks, and MLPs are replaced with spike-driven versions (e.g. ConvSN+DWConv\mathrm{Conv} \to \mathrm{SN}+\mathrm{DWConv}, with DWConv\mathrm{DWConv} denoting depthwise convolution).

The training phase augments the backbone with two supplementary modules—Spike-driven Selective Attention (SSA) and Spike-driven Hybrid State Space (SHS)—both pruned for fast inference, while a Hierarchical Re-ranking Alignment Learning (HRAL) regime ensures backbone robustness in absence of these aids at test time (Chen et al., 22 Dec 2025).

2. Spike-driven Selective Attention (SSA) Block

The SSA block addresses information loss from spiking quantization and enhances feature discriminability through spike-driven gating:

  • Dual Gating: Two spike-driven gates are used—AA (global, attention to salient patches) and GG (local, patchwise importance)—each computed with spike-compatible operations and learnable parameters. SSA applies gating in both local and global paths, combining effects through elementwise multiplication and summation.
  • Pipeline:
  1. Convolutional Positional Encoding (CPE): Fdi+=Fdi+Fla(DWConv(Res(Fdi,P,P)))F_d^{i+} = F_d^i + \mathrm{Fla}(\mathrm{DWConv}(\mathrm{Res}(F_d^i, P, P)))
  2. Spike-processed gating: F~di+\widetilde{F}_d^{i+} via SNN layers; AA and GG from network projections and spike normalizations.
  3. Residual Fusion and MLP Refinement: Multi-level skip connections and spike-driven MLP refinement.
  • Loss Function: Cosine-embedding loss is used to maximize similarity between paired sub-features across views for top-nn active regions:

L1=1Ni,j(1cos(F^di,F^sj)).\mathcal L_1 =\frac{1}{N}\sum_{i,j} \left(1 - \cos(\widehat{F}_d^i, \widehat{F}_s^j)\right).

This dual-path gating with multi-level residuals recovers critical information potentially lost to spiking sparsity (Chen et al., 22 Dec 2025).

3. Spike-driven Hybrid State Space (SHS) Block

To capture non-local dependencies that escape purely convolutional or locally focused attention structures, SHS alternates between 2D convolutions and linear-complexity global sequence modeling:

  • Spatial Prior Injection: Each feature F^di\widehat{F}_d^i is spatially reshaped and locally convolved: Fdηi=Res(F^di,H,W)F_d^{\eta i} = \mathrm{Res}(\widehat{F}_d^i, H, W); Fdηi~=Fdηi+DWConv(SN(Fdηi))F_d^{\widetilde{\eta i}} = F_d^{\eta i} + \mathrm{DWConv}(SN(F_d^{\eta i})).
  • Hybrid State-Space Modeling: Flattened tokens undergo a linear state-space transformation using $\mathrm{HSM\mbox{-}SSD}$, restoring the spatial context post-operation.
  • Loss Function: A patch-wise cross-entropy classification loss is imposed:

L2 on pairs (F^dkηi,F^skηj).\mathcal L_2\ \text{on pairs}\ (\widehat{F}_{d_k}^{\eta i}, \widehat{F}_{s_k}^{\eta j}).

Alternation of 2D and 1D operations enables robust modeling of both local details and long-range correlations, critical for visual disambiguation in complex geo-localization conditions.

4. Hierarchical Re-ranking Alignment Learning (HRAL)

HRAL replaces traditional test-time feature re-ranking with a supervisory strategy imposed during training, aligning raw backbone features with their refined (re-ranked) manifestations:

  • Batch and History Consistency: The method computes top-kk reciprocal sets for affinity construction, query-expansion smoothing, and residual diffusion, followed by feature 2\ell_2 normalization.
  • Contrastive Alignment: Consistency across batches (current and historical) and between cross-view pairs is enforced through a combination of cosine, KL-divergence, and InfoNCE losses.
  • Total Loss:

L3=Lcurrent+Lhistorical+LInfoNCE\mathcal L_3 = \mathcal L_{\mathrm{current}} + \mathcal L_{\mathrm{historical}} + \mathcal L_{\mathrm{InfoNCE}}

This procedure induces the spike-driven backbone to learn features that inherently possess the desirable re-ranking properties, permitting all auxiliary modules to be pruned during inference without degradation in matching performance (Chen et al., 22 Dec 2025).

5. Empirical Performance

SpikeViMFormer demonstrates robust improvements on DVGL tasks, significantly outperforming previous SNN transformers and maintaining strong parity with large ANN baselines in accuracy at a fraction of the computational and memory cost.

Model Params (M) Power (mJ) R@1 U1652 (d→s/s→d) AP U1652 SUES-200 R@1 (150–300m)
SDPL (ANN) 42.6 320.7 90.16 / 93.58 91.64 82.95–97.83
Meta-SpikeFormer 55.4 308.3 78.94 / 88.59 82.07
SpikeViMFormer-T 9.78 32.5 86.10 / 91.58 88.32
SpikeViMFormer-S 18.63 53.9 88.03 / 92.72 89.98 83.48–93.90

Compared to SDPL, SpikeViMFormer-S obtains similar R@1 and AP with 8.4×8.4\times fewer parameters and 13.24×13.24\times less energy consumption on the University-1652 dataset. On SUES-200, it outperforms most prior SNNs and several ANNs at each tested altitude for the drone→satellite retrieval challenge (Chen et al., 22 Dec 2025).

6. Training Configuration and Implementation

SpikeViMFormer utilizes NI-LIF neurons with four spike timesteps (T=4T=4), trains with AdamW (learning rate 1×1041\times10^{-4}, batch size 64), and employs networks of width DD for integer normalization. Datasets include University-1652 and SUES-200; evaluation metrics are Recall@K, mean AP, and estimated energy per inference, calculated from the number of spike-driven operations and standard SNN energy models. All convolutions, attention, and MLPs in the network are implemented as spike-driven analogs. Hyperparameters for loss balancing are set to λ1=0.6\lambda_1=0.6 and λ2=0.54\lambda_2=0.54, with query expansion neighborhood k=15k=15 (Chen et al., 22 Dec 2025).

7. Significance and Design Rationale

SpikeViMFormer’s architecture integrates:

  • SSA: Gating at both macro- and micro-levels to recover information loss due to spike sparsity, ensuring focus on discriminative image regions. Multi-level residuals further prevent information vanishing.
  • SHS: Interleaving local convolution and state-space sequence modeling for joint local-global token context, enabling robust modeling of long-range cross-view correspondences essential for geo-localization.
  • HRAL: Replacing computationally intensive test-time re-ranking with training-time contrastive alignment, ensuring efficiency and resilience across batch and domain variations.

Collectively, these strategies position SpikeViMFormer as a scalable, energy-efficient solution for geo-localization in power-constrained applications, closing the performance gap to heavy ANN models while leveraging the intrinsic sparsity of neuromorphic computation (Chen et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SpikeViMFormer.