Multi-Scale Gaussian KAN (MSGKAN)
- MSGKAN is a nonlinear feature transform module that augments spatial features with multi-scale Gaussian RBF embeddings for effective scale-adaptive detection.
- It integrates a concise local convolutional encoder to fuse complementary spatial and frequency-domain features within the SFFR architecture.
- Empirical results on the SeaDroneSee dataset show improved mAP, confirming its efficacy in handling variable UAV altitudes and multi-scale challenges.
The Multi-Scale Gaussian Kolmogorov–Arnold Network (MSGKAN) is a nonlinear feature transform module introduced within the SFFR (Spatial-Frequency Feature Reconstruction) architecture for multispectral aerial object detection. MSGKAN augments intermediate spatial-domain features with multi-scale Gaussian Radial Basis Function (RBF) embeddings inspired by Kolmogorov–Arnold decomposition, enhancing the model's adaptability to variable object scales and robustifying detector performance under changing UAV flight altitudes. By parameterizing learnable Gaussian centers and incorporating a concise local convolutional encoder, MSGKAN achieves effective nonlinear feature modeling tailored to both fine- and coarse-scale image structures germane to remote sensing and UAV scenarios.
1. MSGKAN in SFFR: Role and Architectural Integration
MSGKAN operates as a spatial-domain feature reconstruction module in the dual-branch KANFusion block of SFFR, targeting intermediate feature maps from individual sensor modalities (e.g., RGB or IR). For a batch-dimensioned input tensor , MSGKAN transforms each local feature vector at spatial position through a non-linear expansion in a multi-scale Gaussian space, followed by a small convolutional encoder. MSGKAN's output is then combined with a complementary frequency-domain feature from the FCEKAN module using learnable weights : An analogous path exists for the other branch, ensuring joint enrichment of features in both spatial and frequency domains, prior to cross-modal fusion.
2. Multi-Scale Gaussian Basis Design
MSGKAN constructs a nonlinear embedding for each feature vector (where ) using a set of learnable RBF centers and a fixed set of Gaussian bandwidths . The Gaussian basis functions are defined as: where the scale parameters (e.g., ) control the width of the radial response and thereby dictate the receptive field of each basis. Each basis essentially identifies feature similarity at a particular scale and with respect to a particular learned centroid.
3. Nonlinear Mapping and KAN Style Embedding
MSGKAN adopts the Kolmogorov–Arnold paradigm by expressing nonlinear transformations as sums over univariate mappings and their compositions. For each spatial position , the input feature is projected into the multi-scale Gaussian basis and linearly combined using a set of learnable weights (one per center-scale pair), typically realized via a pointwise convolution:
Here, the Conv operator is generally a compact (or optionally ) convolution that restores spatial context after global RBF embedding.
4. Algorithmic Steps for Scale-Adaptive Feature Modeling
MSGKAN's procedure to encode scale variance at each spatial location can be summarized as follows:
- (a) Layer Normalization: Normalize across channels () to achieve zero mean and unit variance.
- (b) Distance Computation: For each spatial location , compute Euclidean distances to the centers.
- (c) Multi-Scale RBF Expansion: Compute for each scale . Smaller are sensitive to fine-details, larger accommodate broader, large-scale structures.
- (d) Weighted Summation: Multiply each by corresponding learnable weight and sum across all center–scale pairs, enabling the model to emphasize certain scales for specific scenes.
- (e) Local Convolution: Integrate the resulting embedding to output features of dimensionality using a local Conv, allowing spatial mixing and channel reweighting.
Because real-world scale variations in aerial imagery are linked to UAV altitude, a fixed bank of Gaussian widths provides adaptability to object size changes without downstream modifications to backbone feature map resolution.
5. Training Paradigm and Loss Integration
MSGKAN does not receive a separate, module-specific loss function but is optimized indirectly via losses imposed by the full SFFR detection architecture. Training employs standard multi-task detection objectives:
- Varifocal Loss for classification confidence:
which adaptively emphasizes challenging positive detections.
- Box Regression via and IoU-style penalization.
The joint gradient flow from classification and regression objectives guides the adaptation of RBF width parameters and the linear weights , leading to learned specialization on dataset-specific scale statistics.
6. Empirical Performance and Scale Robustness
Comprehensive experimental validation on the SeaDroneSee dataset demonstrates that the inclusion of MSGKAN yields measurable improvements in object detection performance:
- Baseline (without MSGKAN): mAP = 61.1%, mAP = 31.4%.
- With MSGKAN: mAP = 62.2% (+1.1%), mAP = 32.2% (+0.8%).
- A systematic sweep of scale parameters (Table V) revealed that using scale widths achieves optimal results with mAP = 66.0%, mAP = 32.5%, outperforming both coarser () and denser () settings.
These findings substantiate the critical contribution of multi-scale Gaussian embeddings for dynamic scale adaptation and indicate the presence of a sweet-spot in the bank of scales for robust aerial object detection performance.
7. Operational Principles and Significance
MSGKAN exemplifies a KAN-inspired submodule that elevates each feature vector into a high-dimensional Gaussian RBF manifold, applies end-to-end-learned scale-emphasizing weights, and reprojects these representations back to a task-compatible feature space by lightweight local convolution. This design confers robust, data-driven adaptability to object size changes caused by varying UAV altitudes, obviating the need for explicit multi-scale image resizing or architectural modifications in the backbone. Empirical evidence confirms that this approach substantially improves multispectral object detection benchmarks, both in accuracy and scale-invariance, rendering it directly applicable to real-world UAV perception pipelines for heterogeneous environments.