MKSNet: Multi-Kernel Selection for Small Object Detection

Updated 10 December 2025

The paper introduces a novel multi-kernel selection mechanism that dynamically aggregates multi-scale context for improved small object detection.
It integrates dual attention modules—channel and spatial—to refine features and suppress background clutter effectively.
Empirical evaluations on DOTA and HRSC2016 benchmarks show state-of-the-art performance, particularly for densely packed small objects.

The Multi-Kernel Selection Network (MKSNet) is a convolutional neural network architecture designed for advanced small object detection in high-resolution remote sensing imagery. It introduces a novel multi-kernel selection mechanism for dynamic context aggregation and integrates dual (channel and spatial) attention to enhance feature relevance and suppress background clutter. Empirical evaluations on prominent benchmarks establish MKSNet as state-of-the-art for this task, with particular strength on densely packed, small object categories (Zhang et al., 3 Dec 2025).

1. Architectural Composition and Inference Pipeline

MKSNet processes high-resolution input images (e.g., 6000×6000 pixels) through a structured pipeline that emphasizes multi-scale context extraction and attention-guided feature refinement. The pipeline sequence can be described as:

Input Patch Embedding: Initial 3×3 stride-2 convolution with batch normalization and ReLU, halving image resolution and projecting to initial channel dimension $C_0$ .
Backbone (MKS Blocks): Four successive stages, analogous to ResNet's C2–C5, each comprising repeated multi-kernel selection (MKS) blocks. Stage output channels increase in the sequence {256, 512, 1024, 2048}, with stride-2 convolutional downsampling between stages.
Feature Pyramid Network (FPN): Lateral 1×1 convolutions reduce each stage's output to 256 channels, enabling top-down cross-scale fusion, forming FPN pyramid levels P2–P5.
Detection Head (Oriented-RCNN): Oriented region proposal network (RPN) operates over the FPN levels, producing rotated proposals. Rotated ROIAlign pools features for each proposal, with two-branch heads for classification (cross-entropy) and 5D oriented box regression (smooth L₁ loss).
Output: A set of oriented bounding boxes with categorical scores.

Pipeline (editor’s term):

Input → patch-conv → [MKS block × n₁] → downsample → [MKS block × n₂] → ... → [MKS block × n₄] → FPN → Oriented RPN → Rotated ROIs → classification/regression heads → final detections.

2. Multi-Kernel Selection Block

The MKS block is the architectural core, delivering spatially adaptive multi-scale context aggregation via dynamic kernel selection and weighting.

2.1 Multi-Scale Convolutional Branch Construction

Given an input feature tensor $F_{\mathrm{in}} \in \mathbb{R}^{C\times H \times W}$ , MKSNet parallelizes $S$ convolutional branches. For branch $i$ , convolutional parameters are:

Kernel size: $k_i = \min(5 + 2i, k_{\max})$
Dilation: $d_i = i+1$
Padding: $p_i = \frac{(k_i-1)d_i}{2}$ , $i=1 \ldots S$

Each branch computes:

$B_i = \mathrm{BN}(F_{\mathrm{in}} * K_{k_i \times k_i, d_i; p_i}), \quad T_i = \sigma(W_i^{1 \times 1}*B_i) \in \mathbb{R}^{C/S\times H\times W}$

where $W_i^{1\times1}$ is a channel-reducing 1×1 convolution and $\sigma$ is a pointwise nonlinearity.

2.2 Adaptive Spatial Gating and Fusion

Branch outputs $\{T_i\}_{i=1}^S$ are concatenated to form $T\in \mathbb{R}^{C\times H\times W}$ . Two summary spatial maps are derived:

$M_{\mathrm{avg}}(x,y) = \frac{1}{C}\sum_{c=1}^C T_{c,x,y}, \quad M_{\mathrm{max}}(x,y) = \max_{c=1 \ldots C} T_{c,x,y}$

$M = \mathrm{Concat}(M_{\mathrm{avg}}, M_{\mathrm{max}}) \in \mathbb{R}^{2\times H\times W}$

A small convolution $f^{2\to S}$ (e.g., 7×7 kernel) produces gating maps via sigmoid activation:

$\alpha = \sigma(f^{2\to S}(M)) \in \mathbb{R}^{S\times H\times W}$

Spatial branch-gated fusion yields:

$P = \sum_{i=1}^S \alpha_i \odot T_i, \quad F_{\mathrm{out}} = F_{\mathrm{in}} \odot W^{1\times 1}*P$

Element-wise weighted summation enables each spatial location to favor a kernel scale best preserving local detail or capturing context. Independent sigmoidal gating is used per-branch ( $\alpha_i(x,y)\in(0,1)$ ), without sum-to-one constraint.

2.3 Regularization and Detail Preservation

All convolutional and gating weights use ℓ₂ regularization (AdamW weight decay).
Large kernels/dilations capture broad context, mitigates clutter-based false positives.
Gating weights upweight small kernels at fine-detailed locations, adaptively preserving spatial sharpness.

3. Dual Attention Integration

After each MKS block output, dual attention is applied: channel attention (CA) followed by spatial attention (SA).

3.1 Channel Attention (SE-Style)

Given $F\in\mathbb{R}^{B\times C\times H\times W}$ :

$A = \text{AvgPool}(F) \in \mathbb{R}^{B\times C}, \quad M = \text{MaxPool}(F) \in \mathbb{R}^{B\times C}$

Two fully connected (FC) layers, reduction ratio $r$ :

$\tilde{A} = \delta(W_1A),\; \tilde{M} = \delta(W_1M); \; \tilde{O} = W_2\left(\frac{\tilde{A}+\tilde{M}}{2}\right)$

$M_c = \sigma(\tilde{O}),\; F' = F \odot M_c[\mathbf{:},\mathbf{:},1,1]$

3.2 Spatial Attention (CBAM-Style)

On $F'$ , channel-refined features:

$M_s = \sigma(f^{7\times7}([\text{AvgPool}(F'), \text{MaxPool}(F')])) \in \mathbb{R}^{1\times H\times W}$

$F'' = F' \odot M_s$

Empirically, the CA→SA ordering was found optimal in MKSNet, producing better accuracy than the reverse sequence.

4. Training Regimen and Implementation Protocol

MKSNet's efficacy was established on DOTA-v1.0 (15 categories, high-resolution) and HRSC2016 (maritime ship detection, ~3000 objects).

Framework: PyTorch, Oriented-RCNN backbone.
Optimizer: AdamW, learning rate $4 \times 10^{-4}$ , betas=(0.9, 0.999), weight_decay=0.05.
Batch: 2 images/GPU × 3 GPUs = 6 images.
Schedule: 300 epochs per dataset, cosine decay learning rate, linear warmup (5 epochs).
Preprocessing: Random flips (horizontal/vertical), rotations $\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}$ , color jitter.
Losses: Cross-entropy for classification; smooth L₁ for oriented box regression; no special small-object loss weighting.

5. Empirical Evaluation and Ablation Analysis

Significant improvements were observed on standard small object detection benchmarks.

Dataset	Model	mAP (%)	Params (M)	FLOPs (G)
DOTA-v1.0	MKSNet	78.77	40.7	181
DOTA-v1.0	O-RCNN (R50)	76.12	—	—

Improvements for small object classes: SV (+4.8 pts), RA (+2.9 pts), BC (+2.4 pts).

Dataset	Epochs	MKSNet (%)	O-RCNN (%)
HRSC2016	150	71.95	71.33
HRSC2016	300	84.31	83.89

Ablation Study on DOTA (mAP %):

Base	+SA	+CA	Full (SA+CA)
62.7	66.4	64.3	69.1

The inclusion of both spatial and channel attention (full block) accounts for the largest synergistic gain (+6.4%).

6. Strengths, Limitations, and Prospects

Strengths:

Multi-kernel selection enables local spatial adaptation, balancing context and fine detail.
Dual attention modules (CA+SA) more effectively suppress background clutter.
Relatively lightweight in parameter count and FLOPs versus other high-performing detectors.
Demonstrated robustness for densely packed, small targets.

Limitations:

Sigmoid gating does not enforce mutual exclusion across kernel branches ( $\sum_i \alpha_i$ unconstrained); introducing softmax gating may induce sparser, more interpretable branch selection.
Large kernels and dilation still incur computational overhead; efficient kernel sparsification is an open avenue.

Future Directions:

Adopt learned softmax gating for convex combinations.
Embed transformer-style self-attention within each branch.
Explore dynamic inference-time scheduling of kernel sizes.

MKSNet integrates adaptive multi-scale kernel selection and dual attention in a modular backbone, establishing an effective paradigm for small object detection in complex remote-sensing imagery. The provided architectural description and equations suffice for implementation within contemporary frameworks, and empirical evidence attests to its performance advantages on challenging benchmarks (Zhang et al., 3 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Kernel Selection Network (MKSNet).