MS-RPN: Multi-Scale Region Proposal Networks

Updated 13 January 2026

The paper introduces MS-RPN with multi-stage feature exploitation, enhancing proposal accuracy especially for small objects.
MS-RPN uses independent proposal heads across different backbone levels to merge detailed spatial and semantic information.
Performance evaluations show notable improvements in recall and average precision over single-scale RPNs on benchmarks like PASCAL VOC and COCO.

A Multiple Scale Region Proposal Network (MS-RPN) is a class of object proposal architectures designed to overcome the limitations of single-scale RPNs in object detection pipelines. By leveraging feature maps from multiple backbone depths, MS-RPNs improve recall, especially for small objects, and enhance precise localization. MS-RPN variants have been widely adopted in detection architectures for tasks ranging from object detection benchmarks (PASCAL VOC, COCO) to domain-specific applications such as hand detection. They are frequently contrasted with other hierarchical feature fusion methods and represent a distinct approach from cascaded or purely single-scale proposal generators (Lu et al., 2018, Kong et al., 2016, Lu et al., 2018).

1. Architectural Principle

The core innovation of MS-RPNs lies in their multi-stage exploitation of pyramidally arranged backbone feature maps. In canonical form, an MS-RPN attaches independent region proposal heads to feature maps arising at different semantic depths. For instance, the architecture in "Multi-scale prediction for robust hand detection and classification" places RPN heads on C3, C4, and C5 stages of ResNet-101, whose spatial strides are preserved at $\{4, 8, 16\}$ via the à-trous trick. Each head utilizes a 3×3 convolutional block followed by sibling 1×1 convolutions for classification and regression at each feature level (Lu et al., 2018).

Distinct levels allow the proposal generator to cover a wider spectrum of object sizes: shallow layers provide fine spatial granularity for small-scale objects, while deeper ones supply robust semantic context essential for larger targets. Inference is unified by fusing the per-level outputs, often via upsampling and elementwise summation, yielding a single, high-resolution map from which proposals are ultimately decoded.

2. Mathematical Formulation

At each level $l$ with feature map $C_l$ (of resolution $H_l \times W_l$ ), the MS-RPN applies the following operations:

A 3×3 convolution to produce a 512-dimensional feature.
Two parallel 1×1 convolutions:
- Classification (objectness): outputting $2A$ channels ( $A=$ anchors per location).
- Regression: outputting $4A$ channels (4 offsets per anchor).

Anchors at each spatial location represent different scales and aspect ratios (typically $A = 9$ : 3 scales × 3 ratios). For each level, raw classification and regression maps $P^l_\text{cls}$ and $P^l_\text{reg}$ are produced and subsequently upsampled/fused to the finest spatial resolution (stride = 4).

Bounding box regression is parameterized by

$t_x = \frac{x^* - x_a}{w_a}, \quad t_y = \frac{y^* - y_a}{h_a}, \quad t_w = \log\frac{w^*}{w_a}, \quad t_h = \log\frac{h^*}{h_a}$

where $(x_a, y_a, w_a, h_a)$ is the anchor and $(x^*, y^*, w^*, h^*)$ is ground truth.

The total loss sums classification and regression over all levels $l$ and all anchors $i$ at those levels:

$L(\{p_i^l\},\{t_i^l\}) = \sum_{l=3}^5 \frac{1}{N_l} \sum_{i=1}^{N_l} \big[ L_\text{cls}(p_i^l, p_i^{l*}) + \lambda [p_i^{l*}=1] L_\text{reg}(t_i^l, t_i^{l*}) \big]$

where $L_\text{cls}$ is the binary cross-entropy and $L_\text{reg}$ is the smooth L1 loss. Only positive anchors participate in the regression term (Lu et al., 2018, Lu et al., 2018).

3. Anchor Labeling, Training, and Inference

Anchors are labeled positive if $\mathrm{IoU}(\text{anchor}, \text{GT}) > 0.7$ , negative if $\mathrm{IoU} < 0.3$ , with ambiguous cases ignored. Mini-batching samples up to 256 anchors per image (maximum 128 positives). Data augmentation can include flips and color jitter. Optimization commonly uses SGD with momentum and learning rate scheduling as specified per implementation (Lu et al., 2018).

At inference, all anchors from the fused proposal map are decoded, filtered by a minimum objectness threshold, and non-maximum suppression (NMS, typically at $\mathrm{IoU} = 0.7$ ) is applied. The top $k$ (typically $k = 1\,000$ ) proposals are forwarded to the second-stage classifier (Lu et al., 2018).

4. Comparison to Other Multi-Scale and Hierarchical Approaches

MS-RPN is differentiated from approaches such as HyperNet and Feature Pyramid Networks (FPN) primarily by its independence of per-level proposal heads and late fusion strategy. Whereas HyperNet fuses features at multiple depths into a single Hyper Feature prior to region proposal, MS-RPN preserves distinct proposal heads at each scale and combines the resultant outputs after proposal scoring and regression (Kong et al., 2016, Lu et al., 2018). In contrast, FPN employs a top-down pathway with lateral connections to merge multi-scale features before applying a single, shared RPN head.

Quantitatively, recall ablation studies on the VIVA hand dataset with 300 proposals show a significant incremental benefit of fusing deeper features:

C5-only: recall ≈ 75%
C4+C5: recall ≈ 83%
C3+C4+C5 (full MS-RPN): recall ≈ 88%

This incremental gain confirms the utility of multi-level predictions, especially for small objects that are poorly resolved at deeper, coarser levels (Lu et al., 2018).

5. Performance Metrics and Ablation Results

MS-RPN achieves consistent recall and average precision improvements compared to single-scale baselines:

On hand detection benchmarks, MS-RPN boosted average recall by ≈4.2 points (from ≈83.4% to ≈87.6%) at 300 proposals (VIVA Challenge Level 2).
On the ARJ dataset, MS-RPN combined with R-FCN reached AP ≈84.0%, AR ≈86.4% (Lu et al., 2018).

Ablation reveals that high-resolution shallow feature maps (C3) are required for spatial precision on small targets, while deep maps (C5) contribute discriminative robustness for large objects.

The proposal stage achieves ≈5 FPS inference on modern GPU hardware (Titan X), and the two-stage system (with downstream region-wise classification) sustains ≈3 FPS end-to-end (Lu et al., 2018).

6. Extensions: Scale-Invariant and Position-Sensitive Variants

Recent MS-RPN derivatives incorporate position-sensitive pooling (as in R-FCN [Dai et al.]), large kernel convolution (GCN), and more sophisticated anchor matching:

In "Toward Scale-Invariance and Position-Sensitive Region Proposal Networks," decoded feature maps at several pyramid levels (D₂–D₆, strides $s=4,8,16,32,64$ ) are processed with shared or non-shared global-kernel heads for improved context (Lu et al., 2018).
Anchors are directly matched in size and aspect to their receptive windows, and the presence of position-sensitive heads propagates translation-invariance for objectness while retaining translation-variance for regression.
Empirical results: Average Recall at 1,000 proposals (VOC07test) increases from 0.480 (RPN baseline) to 0.653 (GCN-NS + PS); COCO improvements are +44%. AR for small objects (AR $_s$ ) also increases substantially (Lu et al., 2018).

A plausible implication is that further exploitation of hierarchical features and global context will continue to improve the balance between objectness robustness, spatial localization, and computational efficiency.

7. Summary Table: Multi-Scale RPN Architectural Overview

Component	MS-RPN (Lu et al., 2018)	HyperNet (Kong et al., 2016)	Scale-Invariant PSP RPN (Lu et al., 2018)
Feature Layers	C3, C4, C5 (ResNet)	Conv3, Conv4, Conv5 (VGG16)	Conv2–5 (ResNet), top-down + skip
Proposal Heads	Independent per level, fused late	Single on fused Hyper Feature	Shared/non-shared, position-sensitive heads
Fusion	Upsampling, addition to stride=4 map	Channel concat, 3×3 compress + LRN	FPN-like, lateral & top-down
Inference FPS	5 FPS (proposal stage)	5 FPS (fast variant)	≈22 FPS on 640×640, 10 FPS on 768²
AR $_s$ /AR $_{1k}$	+4.2 pts over single-scale	95–97% (50–100 props, IoU=0.5)	+36% (VOC), +44% (COCO) over baseline

These comparisons synthesize core factual distinctions and quantitative advantages conferred by MS-RPN, as evidenced across the cited literature (Kong et al., 2016, Lu et al., 2018, Lu et al., 2018).