Papers
Topics
Authors
Recent
2000 character limit reached

MKSNet: Multi-Kernel Selection for Small Object Detection

Updated 10 December 2025
  • The paper introduces a novel multi-kernel selection mechanism that dynamically aggregates multi-scale context for improved small object detection.
  • It integrates dual attention modules—channel and spatial—to refine features and suppress background clutter effectively.
  • Empirical evaluations on DOTA and HRSC2016 benchmarks show state-of-the-art performance, particularly for densely packed small objects.

The Multi-Kernel Selection Network (MKSNet) is a convolutional neural network architecture designed for advanced small object detection in high-resolution remote sensing imagery. It introduces a novel multi-kernel selection mechanism for dynamic context aggregation and integrates dual (channel and spatial) attention to enhance feature relevance and suppress background clutter. Empirical evaluations on prominent benchmarks establish MKSNet as state-of-the-art for this task, with particular strength on densely packed, small object categories (Zhang et al., 3 Dec 2025).

1. Architectural Composition and Inference Pipeline

MKSNet processes high-resolution input images (e.g., 6000Ɨ6000 pixels) through a structured pipeline that emphasizes multi-scale context extraction and attention-guided feature refinement. The pipeline sequence can be described as:

  1. Input Patch Embedding: Initial 3Ɨ3 stride-2 convolution with batch normalization and ReLU, halving image resolution and projecting to initial channel dimension C0C_0.
  2. Backbone (MKS Blocks): Four successive stages, analogous to ResNet's C2–C5, each comprising repeated multi-kernel selection (MKS) blocks. Stage output channels increase in the sequence {256, 512, 1024, 2048}, with stride-2 convolutional downsampling between stages.
  3. Feature Pyramid Network (FPN): Lateral 1Ɨ1 convolutions reduce each stage's output to 256 channels, enabling top-down cross-scale fusion, forming FPN pyramid levels P2–P5.
  4. Detection Head (Oriented-RCNN): Oriented region proposal network (RPN) operates over the FPN levels, producing rotated proposals. Rotated ROIAlign pools features for each proposal, with two-branch heads for classification (cross-entropy) and 5D oriented box regression (smooth L₁ loss).
  5. Output: A set of oriented bounding boxes with categorical scores.

Pipeline (editor’s term):

Input → patch-conv → [MKS block Ɨ n₁] → downsample → [MKS block Ɨ nā‚‚] → ... → [MKS block Ɨ nā‚„] → FPN → Oriented RPN → Rotated ROIs → classification/regression heads → final detections.

2. Multi-Kernel Selection Block

The MKS block is the architectural core, delivering spatially adaptive multi-scale context aggregation via dynamic kernel selection and weighting.

2.1 Multi-Scale Convolutional Branch Construction

Given an input feature tensor Fin∈RCƗHƗWF_{\mathrm{in}} \in \mathbb{R}^{C\times H \times W}, MKSNet parallelizes SS convolutional branches. For branch ii, convolutional parameters are:

  • Kernel size: ki=min⁔(5+2i,kmax⁔)k_i = \min(5 + 2i, k_{\max})
  • Dilation: di=i+1d_i = i+1
  • Padding: pi=(kiāˆ’1)di2p_i = \frac{(k_i-1)d_i}{2}, i=1…Si=1 \ldots S

Each branch computes:

Bi=BN(Fināˆ—KkiƗki,di;pi),Ti=σ(Wi1Ɨ1āˆ—Bi)∈RC/SƗHƗWB_i = \mathrm{BN}(F_{\mathrm{in}} * K_{k_i \times k_i, d_i; p_i}), \quad T_i = \sigma(W_i^{1 \times 1}*B_i) \in \mathbb{R}^{C/S\times H\times W}

where Wi1Ɨ1W_i^{1\times1} is a channel-reducing 1Ɨ1 convolution and σ\sigma is a pointwise nonlinearity.

2.2 Adaptive Spatial Gating and Fusion

Branch outputs {Ti}i=1S\{T_i\}_{i=1}^S are concatenated to form T∈RCƗHƗWT\in \mathbb{R}^{C\times H\times W}. Two summary spatial maps are derived:

Mavg(x,y)=1Cāˆ‘c=1CTc,x,y,Mmax(x,y)=max⁔c=1…CTc,x,yM_{\mathrm{avg}}(x,y) = \frac{1}{C}\sum_{c=1}^C T_{c,x,y}, \quad M_{\mathrm{max}}(x,y) = \max_{c=1 \ldots C} T_{c,x,y}

M=Concat(Mavg,Mmax)∈R2ƗHƗWM = \mathrm{Concat}(M_{\mathrm{avg}}, M_{\mathrm{max}}) \in \mathbb{R}^{2\times H\times W}

A small convolution f2→Sf^{2\to S} (e.g., 7Ɨ7 kernel) produces gating maps via sigmoid activation:

α=σ(f2→S(M))∈RSƗHƗW\alpha = \sigma(f^{2\to S}(M)) \in \mathbb{R}^{S\times H\times W}

Spatial branch-gated fusion yields:

P=āˆ‘i=1SαiāŠ™Ti,Fout=FināŠ™W1Ɨ1āˆ—PP = \sum_{i=1}^S \alpha_i \odot T_i, \quad F_{\mathrm{out}} = F_{\mathrm{in}} \odot W^{1\times 1}*P

Element-wise weighted summation enables each spatial location to favor a kernel scale best preserving local detail or capturing context. Independent sigmoidal gating is used per-branch (αi(x,y)∈(0,1)\alpha_i(x,y)\in(0,1)), without sum-to-one constraint.

2.3 Regularization and Detail Preservation

  • All convolutional and gating weights use ā„“ā‚‚ regularization (AdamW weight decay).
  • Large kernels/dilations capture broad context, mitigates clutter-based false positives.
  • Gating weights upweight small kernels at fine-detailed locations, adaptively preserving spatial sharpness.

3. Dual Attention Integration

After each MKS block output, dual attention is applied: channel attention (CA) followed by spatial attention (SA).

3.1 Channel Attention (SE-Style)

Given F∈RBƗCƗHƗWF\in\mathbb{R}^{B\times C\times H\times W}:

A=AvgPool(F)∈RBƗC,M=MaxPool(F)∈RBƗCA = \text{AvgPool}(F) \in \mathbb{R}^{B\times C}, \quad M = \text{MaxPool}(F) \in \mathbb{R}^{B\times C}

Two fully connected (FC) layers, reduction ratio rr:

A~=Ī“(W1A),ā€…ā€ŠM~=Ī“(W1M);ā€…ā€ŠO~=W2(A~+M~2)\tilde{A} = \delta(W_1A),\; \tilde{M} = \delta(W_1M); \; \tilde{O} = W_2\left(\frac{\tilde{A}+\tilde{M}}{2}\right)

Mc=σ(O~),ā€…ā€ŠF′=FāŠ™Mc[:,:,1,1]M_c = \sigma(\tilde{O}),\; F' = F \odot M_c[\mathbf{:},\mathbf{:},1,1]

3.2 Spatial Attention (CBAM-Style)

On F′F', channel-refined features:

Ms=σ(f7Ɨ7([AvgPool(F′),MaxPool(F′)]))∈R1ƗHƗWM_s = \sigma(f^{7\times7}([\text{AvgPool}(F'), \text{MaxPool}(F')])) \in \mathbb{R}^{1\times H\times W}

F′′=Fā€²āŠ™MsF'' = F' \odot M_s

Empirically, the CA→SA ordering was found optimal in MKSNet, producing better accuracy than the reverse sequence.

4. Training Regimen and Implementation Protocol

MKSNet's efficacy was established on DOTA-v1.0 (15 categories, high-resolution) and HRSC2016 (maritime ship detection, ~3000 objects).

  • Framework: PyTorch, Oriented-RCNN backbone.
  • Optimizer: AdamW, learning rate 4Ɨ10āˆ’44 \times 10^{-4}, betas=(0.9, 0.999), weight_decay=0.05.
  • Batch: 2 images/GPU Ɨ 3 GPUs = 6 images.
  • Schedule: 300 epochs per dataset, cosine decay learning rate, linear warmup (5 epochs).
  • Preprocessing: Random flips (horizontal/vertical), rotations {0∘,90∘,180∘,270∘}\{0^{\circ}, 90^{\circ}, 180^{\circ}, 270^{\circ}\}, color jitter.
  • Losses: Cross-entropy for classification; smooth L₁ for oriented box regression; no special small-object loss weighting.

5. Empirical Evaluation and Ablation Analysis

Significant improvements were observed on standard small object detection benchmarks.

Dataset Model mAP (%) Params (M) FLOPs (G)
DOTA-v1.0 MKSNet 78.77 40.7 181
DOTA-v1.0 O-RCNN (R50) 76.12 — —

Improvements for small object classes: SV (+4.8 pts), RA (+2.9 pts), BC (+2.4 pts).

Dataset Epochs MKSNet (%) O-RCNN (%)
HRSC2016 150 71.95 71.33
HRSC2016 300 84.31 83.89

Ablation Study on DOTA (mAP %):

Base +SA +CA Full (SA+CA)
62.7 66.4 64.3 69.1

The inclusion of both spatial and channel attention (full block) accounts for the largest synergistic gain (+6.4%).

6. Strengths, Limitations, and Prospects

Strengths:

  • Multi-kernel selection enables local spatial adaptation, balancing context and fine detail.
  • Dual attention modules (CA+SA) more effectively suppress background clutter.
  • Relatively lightweight in parameter count and FLOPs versus other high-performing detectors.
  • Demonstrated robustness for densely packed, small targets.

Limitations:

  • Sigmoid gating does not enforce mutual exclusion across kernel branches (āˆ‘iαi\sum_i \alpha_i unconstrained); introducing softmax gating may induce sparser, more interpretable branch selection.
  • Large kernels and dilation still incur computational overhead; efficient kernel sparsification is an open avenue.

Future Directions:

  • Adopt learned softmax gating for convex combinations.
  • Embed transformer-style self-attention within each branch.
  • Explore dynamic inference-time scheduling of kernel sizes.

MKSNet integrates adaptive multi-scale kernel selection and dual attention in a modular backbone, establishing an effective paradigm for small object detection in complex remote-sensing imagery. The provided architectural description and equations suffice for implementation within contemporary frameworks, and empirical evidence attests to its performance advantages on challenging benchmarks (Zhang et al., 3 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multi-Kernel Selection Network (MKSNet).