MKSNet: Multi-Kernel Selection for Small Object Detection
- The paper introduces a novel multi-kernel selection mechanism that dynamically aggregates multi-scale context for improved small object detection.
- It integrates dual attention modulesāchannel and spatialāto refine features and suppress background clutter effectively.
- Empirical evaluations on DOTA and HRSC2016 benchmarks show state-of-the-art performance, particularly for densely packed small objects.
The Multi-Kernel Selection Network (MKSNet) is a convolutional neural network architecture designed for advanced small object detection in high-resolution remote sensing imagery. It introduces a novel multi-kernel selection mechanism for dynamic context aggregation and integrates dual (channel and spatial) attention to enhance feature relevance and suppress background clutter. Empirical evaluations on prominent benchmarks establish MKSNet as state-of-the-art for this task, with particular strength on densely packed, small object categories (Zhang et al., 3 Dec 2025).
1. Architectural Composition and Inference Pipeline
MKSNet processes high-resolution input images (e.g., 6000Ć6000 pixels) through a structured pipeline that emphasizes multi-scale context extraction and attention-guided feature refinement. The pipeline sequence can be described as:
- Input Patch Embedding: Initial 3Ć3 stride-2 convolution with batch normalization and ReLU, halving image resolution and projecting to initial channel dimension .
- Backbone (MKS Blocks): Four successive stages, analogous to ResNet's C2āC5, each comprising repeated multi-kernel selection (MKS) blocks. Stage output channels increase in the sequence {256, 512, 1024, 2048}, with stride-2 convolutional downsampling between stages.
- Feature Pyramid Network (FPN): Lateral 1Ć1 convolutions reduce each stage's output to 256 channels, enabling top-down cross-scale fusion, forming FPN pyramid levels P2āP5.
- Detection Head (Oriented-RCNN): Oriented region proposal network (RPN) operates over the FPN levels, producing rotated proposals. Rotated ROIAlign pools features for each proposal, with two-branch heads for classification (cross-entropy) and 5D oriented box regression (smooth Lā loss).
- Output: A set of oriented bounding boxes with categorical scores.
Pipeline (editorās term):
Input ā patch-conv ā [MKS block Ć nā] ā downsample ā [MKS block Ć nā] ā ... ā [MKS block Ć nā] ā FPN ā Oriented RPN ā Rotated ROIs ā classification/regression heads ā final detections.
2. Multi-Kernel Selection Block
The MKS block is the architectural core, delivering spatially adaptive multi-scale context aggregation via dynamic kernel selection and weighting.
2.1 Multi-Scale Convolutional Branch Construction
Given an input feature tensor , MKSNet parallelizes convolutional branches. For branch , convolutional parameters are:
- Kernel size:
- Dilation:
- Padding: ,
Each branch computes:
where is a channel-reducing 1Ć1 convolution and is a pointwise nonlinearity.
2.2 Adaptive Spatial Gating and Fusion
Branch outputs are concatenated to form . Two summary spatial maps are derived:
A small convolution (e.g., 7Ć7 kernel) produces gating maps via sigmoid activation:
Spatial branch-gated fusion yields:
Element-wise weighted summation enables each spatial location to favor a kernel scale best preserving local detail or capturing context. Independent sigmoidal gating is used per-branch (), without sum-to-one constraint.
2.3 Regularization and Detail Preservation
- All convolutional and gating weights use āā regularization (AdamW weight decay).
- Large kernels/dilations capture broad context, mitigates clutter-based false positives.
- Gating weights upweight small kernels at fine-detailed locations, adaptively preserving spatial sharpness.
3. Dual Attention Integration
After each MKS block output, dual attention is applied: channel attention (CA) followed by spatial attention (SA).
3.1 Channel Attention (SE-Style)
Given :
Two fully connected (FC) layers, reduction ratio :
3.2 Spatial Attention (CBAM-Style)
On , channel-refined features:
Empirically, the CAāSA ordering was found optimal in MKSNet, producing better accuracy than the reverse sequence.
4. Training Regimen and Implementation Protocol
MKSNet's efficacy was established on DOTA-v1.0 (15 categories, high-resolution) and HRSC2016 (maritime ship detection, ~3000 objects).
- Framework: PyTorch, Oriented-RCNN backbone.
- Optimizer: AdamW, learning rate , betas=(0.9, 0.999), weight_decay=0.05.
- Batch: 2 images/GPU Ć 3 GPUs = 6 images.
- Schedule: 300 epochs per dataset, cosine decay learning rate, linear warmup (5 epochs).
- Preprocessing: Random flips (horizontal/vertical), rotations , color jitter.
- Losses: Cross-entropy for classification; smooth Lā for oriented box regression; no special small-object loss weighting.
5. Empirical Evaluation and Ablation Analysis
Significant improvements were observed on standard small object detection benchmarks.
| Dataset | Model | mAP (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|
| DOTA-v1.0 | MKSNet | 78.77 | 40.7 | 181 |
| DOTA-v1.0 | O-RCNN (R50) | 76.12 | ā | ā |
Improvements for small object classes: SV (+4.8 pts), RA (+2.9 pts), BC (+2.4 pts).
| Dataset | Epochs | MKSNet (%) | O-RCNN (%) |
|---|---|---|---|
| HRSC2016 | 150 | 71.95 | 71.33 |
| HRSC2016 | 300 | 84.31 | 83.89 |
Ablation Study on DOTA (mAP %):
| Base | +SA | +CA | Full (SA+CA) |
|---|---|---|---|
| 62.7 | 66.4 | 64.3 | 69.1 |
The inclusion of both spatial and channel attention (full block) accounts for the largest synergistic gain (+6.4%).
6. Strengths, Limitations, and Prospects
Strengths:
- Multi-kernel selection enables local spatial adaptation, balancing context and fine detail.
- Dual attention modules (CA+SA) more effectively suppress background clutter.
- Relatively lightweight in parameter count and FLOPs versus other high-performing detectors.
- Demonstrated robustness for densely packed, small targets.
Limitations:
- Sigmoid gating does not enforce mutual exclusion across kernel branches ( unconstrained); introducing softmax gating may induce sparser, more interpretable branch selection.
- Large kernels and dilation still incur computational overhead; efficient kernel sparsification is an open avenue.
Future Directions:
- Adopt learned softmax gating for convex combinations.
- Embed transformer-style self-attention within each branch.
- Explore dynamic inference-time scheduling of kernel sizes.
MKSNet integrates adaptive multi-scale kernel selection and dual attention in a modular backbone, establishing an effective paradigm for small object detection in complex remote-sensing imagery. The provided architectural description and equations suffice for implementation within contemporary frameworks, and empirical evidence attests to its performance advantages on challenging benchmarks (Zhang et al., 3 Dec 2025).