Position-Sensitive Score Maps in IS-FCN

Updated 14 January 2026

Position-sensitive score maps are a set of k² specialized score maps that partition object proposals into spatial subregions for precise instance segmentation.
They modify the traditional FCN architecture by replacing the 1×1 convolution with one generating multiple (k²) channels, enabling end-to-end fast inference.
This approach delivers competitive segmentation performance with reduced computational overhead and memory usage compared to conventional per-proposal methods.

Position-sensitive score maps are a mechanism introduced within the Instance-Sensitive Fully Convolutional Networks (IS-FCN) architecture to facilitate instance-level segmentation using fully convolutional networks. Unlike conventional FCNs that produce a single per-pixel score map per class, position-sensitive score maps decompose the prediction process into a set of $k^2$ score maps, each responsible for modeling the likelihood that a pixel belongs to a specific relative spatial subregion (cell) within any candidate object bounding box. This design enables the efficient assembly of instance-level mask proposals directly from shared, low-dimensional output tensors, eliminating the need for high-dimensional per-proposal computation and supporting fast end-to-end training and inference (Dai et al., 2016).

1. Definition of Position-Sensitive Score Maps

Let $k$ denote the side of a uniform grid, and $M=k^2$ the total number of spatial cells used to tile the interior of any object bounding box. The IS-FCN outputs $M$ individual score maps $S_1, S_2, \dots, S_{M}$ (optionally including a $(M+1)$ -th “background” channel). Each score map $S_p(x,y)$ , for $p \in \{1,\dots,k^2\}$ , encodes the likelihood that pixel $(x,y)$ falls into relative cell- $p$ across all object instances in the image. The mapping from channel index to cell is

$u = (p-1)\ \mathrm{mod}\ k,\quad v = \left\lfloor\frac{(p-1)}{k}\right\rfloor$

so that score map $S_p$ is responsible for relative grid location $(u,v)$ . Figure 1 in (Dai et al., 2016) illustrates with $k=3$ how each of the $9$ score maps “lights up” one spatial sub-cell of each instance.

2. Network Architecture and Output Head Modification

In classical FCNs for semantic segmentation, the final layer typically employs a $1\times1$ convolution to output $C$ channels (number of classes). For position-sensitive score maps, this $1\times1$ convolution is replaced with one generating $M=k^2$ output channels (or $M+1$ with background). If the backbone produces features $F \in \mathbb{R}^{H\times W\times D}$ , the head is: $\text{conv }\,1\times1,\ D \rightarrow M,\quad \Rightarrow\quad S(F;\Theta) \in \mathbb{R}^{H\times W\times M}$ where $S_p(x,y) = f_p(I;\Theta)$ represents the response for grid cell $p$ . This enables a single FCN forward pass to produce all position-sensitive maps for the entire image.

3. Pixel-wise Labeling and Mathematical Formulation

Given an image $I$ and ground truth instance bounding boxes $B_i = [x_{1i},y_{1i},x_{2i},y_{2i}]$ , training proceeds by assigning each pixel $(x,y)$ within any $B_i$ a ground truth label $p^* \in \{1,\dots,k^2\}$ corresponding to its spatial cell in the relative grid. The process is as follows:

Compute normalized offsets within the bounding box:

$\alpha = \frac{x - x_{1i}}{x_{2i} - x_{1i}},\quad \beta = \frac{y - y_{1i}}{y_{2i} - y_{1i}}$

Quantize these offsets to grid bins:

$u^* = \min(\lfloor k\alpha \rfloor, k-1),\quad v^* = \min(\lfloor k\beta \rfloor, k-1)$

The ground-truth index is $p^* = v^* \cdot k + u^* + 1$ .

Pixels not contained in any instance are labeled as background (channel 0 or $k^2+1$ ).

4. Assembly of Instance Masks from Score Maps

Inference involves constructing instance candidates from the $k^2$ position-sensitive maps. For each candidate bounding box $R = [x_1, y_1, x_2, y_2]$ (obtained via sliding windows or region proposals), the following procedure is followed:

For each $(u,v)$ $(u, v)$ in a $k\times k$ $k \times k$ grid ( $u,v = 0, \dots, k-1$ $u, v = 0, \dots, k - 1$ ):
- Select map $p = u + v\cdot k + 1$ .
- Map grid location to full image:
$x_\mathrm{map} = x_1 + (u + 0.5)\cdot\frac{w}{k},\quad y_\mathrm{map} = y_1 + (v + 0.5)\cdot\frac{h}{k} \quad (w = x_2 - x_1,\ h = y_2 - y_1)$ - Extract $s_{uv} = S_p(x_\mathrm{map}, y_\mathrm{map})$ .
Assemble $k^2$ values as the $k \times k$ mask $M_R(u,v) = s_{uv}$ (optionally upsample to $R$ ’s full resolution).
Threshold $M_R$ (e.g., at $0.5$ after sigmoid) for a binary mask.
Aggregate mask scores to score the box:

$\mathrm{score}(R) = \frac{1}{k^2} \sum_{u, v} M_R(u, v)$

Apply class-agnostic non-maximum suppression (NMS) to select high-scoring masks.

This pipeline is shown in Figure 2 of (Dai et al., 2016).

5. Training Objective and Optimization

Training is performed end-to-end with a per-pixel softmax cross-entropy loss over $k^2+1$ channels (including background). The loss function is: $L(\Theta) = -\sum_{(x,y)\in \Omega_\mathrm{pos}} \log P(x,y, p^*(x,y);\Theta) \;\;-\;\; \sum_{(x,y)\in\Omega_\mathrm{neg}} \log P(x,y, 0;\Theta) + \lambda\|\Theta\|^2$ where:

$P(x, y, p; \Theta) = \mathrm{softmax}_p(S_\cdot(x, y))$ ,
$\Omega_\mathrm{pos}$ are pixels within instances (each with $p^*(x,y)\in\{1,\dots,k^2\}$ ),
$\Omega_\mathrm{neg}$ are sampled background pixels (target $0$), typically balancing pos:neg as $1:3$,
$\lambda$ is weight decay regularization.

This loss encourages each position-sensitive map to specialize in recognizing relative locations of objects and discourages false activations outside object regions.

6. Relation to Prior Methods

The method is compared with R-FCN (Dai et al., 2016) and DeepMask (Pinheiro et al., 2015) (Dai et al., 2016). R-FCN also uses $k^2$ position-sensitive maps, but only for classification and bounding box regression; IS-FCN extends the idea to pixel-accurate segmentation masks. In contrast, DeepMask uses hundreds of proposal-specific networks or branches (one per window location), causing high computational and memory overhead. The position-sensitive mapping approach in IS-FCN requires only a single pass to compute $k^2$ shared maps and assembles per-proposal masks via efficient cropping and interpolation, yielding an order-of-magnitude reduction in per-proposal computation and significant memory savings.

Empirical results show that this architecture permits end-to-end training for both localization and segmentation and provides competitive instance segmentation performance on PASCAL VOC and MS COCO benchmarks (see Table 1 in (Dai et al., 2016) for ablation by $k$ value).

7. Practical Considerations and Impact

The primary advantages of position-sensitive score maps are computational efficiency, low memory overhead, and the capacity for precise spatial localization within object proposals. By transforming the segmentation problem into the assembly of local spatial cues from compact, shared maps, IS-FCN and similar architectures enable scalable instance segmentation with relatively modest architectural modifications. A plausible implication is that this strategy facilitates the extension of FCN-based semantic segmentation architectures to instance-aware tasks with minimal increase in inference cost, and provides a general-purpose template for designing parsimonious prediction heads for spatially-structured outputs (Dai et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Instance-sensitive Fully Convolutional Networks (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Position-Sensitive Score Maps.