Papers
Topics
Authors
Recent
Search
2000 character limit reached

Position-Sensitive Score Maps in IS-FCN

Updated 14 January 2026
  • Position-sensitive score maps are a set of k² specialized score maps that partition object proposals into spatial subregions for precise instance segmentation.
  • They modify the traditional FCN architecture by replacing the 1×1 convolution with one generating multiple (k²) channels, enabling end-to-end fast inference.
  • This approach delivers competitive segmentation performance with reduced computational overhead and memory usage compared to conventional per-proposal methods.

Position-sensitive score maps are a mechanism introduced within the Instance-Sensitive Fully Convolutional Networks (IS-FCN) architecture to facilitate instance-level segmentation using fully convolutional networks. Unlike conventional FCNs that produce a single per-pixel score map per class, position-sensitive score maps decompose the prediction process into a set of k2k^2 score maps, each responsible for modeling the likelihood that a pixel belongs to a specific relative spatial subregion (cell) within any candidate object bounding box. This design enables the efficient assembly of instance-level mask proposals directly from shared, low-dimensional output tensors, eliminating the need for high-dimensional per-proposal computation and supporting fast end-to-end training and inference (Dai et al., 2016).

1. Definition of Position-Sensitive Score Maps

Let kk denote the side of a uniform grid, and M=k2M=k^2 the total number of spatial cells used to tile the interior of any object bounding box. The IS-FCN outputs MM individual score maps S1,S2,,SMS_1, S_2, \dots, S_{M} (optionally including a (M+1)(M+1)-th “background” channel). Each score map Sp(x,y)S_p(x,y), for p{1,,k2}p \in \{1,\dots,k^2\}, encodes the likelihood that pixel (x,y)(x,y) falls into relative cell-pp across all object instances in the image. The mapping from channel index to cell is

u=(p1) mod k,v=(p1)ku = (p-1)\ \mathrm{mod}\ k,\quad v = \left\lfloor\frac{(p-1)}{k}\right\rfloor

so that score map SpS_p is responsible for relative grid location (u,v)(u,v). Figure 1 in (Dai et al., 2016) illustrates with k=3k=3 how each of the $9$ score maps “lights up” one spatial sub-cell of each instance.

2. Network Architecture and Output Head Modification

In classical FCNs for semantic segmentation, the final layer typically employs a 1×11\times1 convolution to output CC channels (number of classes). For position-sensitive score maps, this 1×11\times1 convolution is replaced with one generating M=k2M=k^2 output channels (or M+1M+1 with background). If the backbone produces features FRH×W×DF \in \mathbb{R}^{H\times W\times D}, the head is: conv 1×1, DM,S(F;Θ)RH×W×M\text{conv }\,1\times1,\ D \rightarrow M,\quad \Rightarrow\quad S(F;\Theta) \in \mathbb{R}^{H\times W\times M} where Sp(x,y)=fp(I;Θ)S_p(x,y) = f_p(I;\Theta) represents the response for grid cell pp. This enables a single FCN forward pass to produce all position-sensitive maps for the entire image.

3. Pixel-wise Labeling and Mathematical Formulation

Given an image II and ground truth instance bounding boxes Bi=[x1i,y1i,x2i,y2i]B_i = [x_{1i},y_{1i},x_{2i},y_{2i}], training proceeds by assigning each pixel (x,y)(x,y) within any BiB_i a ground truth label p{1,,k2}p^* \in \{1,\dots,k^2\} corresponding to its spatial cell in the relative grid. The process is as follows:

  • Compute normalized offsets within the bounding box:

α=xx1ix2ix1i,β=yy1iy2iy1i\alpha = \frac{x - x_{1i}}{x_{2i} - x_{1i}},\quad \beta = \frac{y - y_{1i}}{y_{2i} - y_{1i}}

  • Quantize these offsets to grid bins:

u=min(kα,k1),v=min(kβ,k1)u^* = \min(\lfloor k\alpha \rfloor, k-1),\quad v^* = \min(\lfloor k\beta \rfloor, k-1)

  • The ground-truth index is p=vk+u+1p^* = v^* \cdot k + u^* + 1.

Pixels not contained in any instance are labeled as background (channel 0 or k2+1k^2+1).

4. Assembly of Instance Masks from Score Maps

Inference involves constructing instance candidates from the k2k^2 position-sensitive maps. For each candidate bounding box R=[x1,y1,x2,y2]R = [x_1, y_1, x_2, y_2] (obtained via sliding windows or region proposals), the following procedure is followed:

  • For each (u,v)(u,v) in a k×kk\times k grid (u,v=0,,k1u,v = 0, \dots, k-1):

    • Select map p=u+vk+1p = u + v\cdot k + 1.
    • Map grid location to full image:

    xmap=x1+(u+0.5)wk,ymap=y1+(v+0.5)hk(w=x2x1, h=y2y1)x_\mathrm{map} = x_1 + (u + 0.5)\cdot\frac{w}{k},\quad y_\mathrm{map} = y_1 + (v + 0.5)\cdot\frac{h}{k} \quad (w = x_2 - x_1,\ h = y_2 - y_1) - Extract suv=Sp(xmap,ymap)s_{uv} = S_p(x_\mathrm{map}, y_\mathrm{map}).

  • Assemble k2k^2 values as the k×kk \times k mask MR(u,v)=suvM_R(u,v) = s_{uv} (optionally upsample to RR’s full resolution).
  • Threshold MRM_R (e.g., at $0.5$ after sigmoid) for a binary mask.
  • Aggregate mask scores to score the box:

score(R)=1k2u,vMR(u,v)\mathrm{score}(R) = \frac{1}{k^2} \sum_{u, v} M_R(u, v)

  • Apply class-agnostic non-maximum suppression (NMS) to select high-scoring masks.

This pipeline is shown in Figure 2 of (Dai et al., 2016).

5. Training Objective and Optimization

Training is performed end-to-end with a per-pixel softmax cross-entropy loss over k2+1k^2+1 channels (including background). The loss function is: L(Θ)=(x,y)ΩposlogP(x,y,p(x,y);Θ)        (x,y)ΩneglogP(x,y,0;Θ)+λΘ2L(\Theta) = -\sum_{(x,y)\in \Omega_\mathrm{pos}} \log P(x,y, p^*(x,y);\Theta) \;\;-\;\; \sum_{(x,y)\in\Omega_\mathrm{neg}} \log P(x,y, 0;\Theta) + \lambda\|\Theta\|^2 where:

  • P(x,y,p;Θ)=softmaxp(S(x,y))P(x, y, p; \Theta) = \mathrm{softmax}_p(S_\cdot(x, y)),
  • Ωpos\Omega_\mathrm{pos} are pixels within instances (each with p(x,y){1,,k2}p^*(x,y)\in\{1,\dots,k^2\}),
  • Ωneg\Omega_\mathrm{neg} are sampled background pixels (target $0$), typically balancing pos:neg as $1:3$,
  • λ\lambda is weight decay regularization.

This loss encourages each position-sensitive map to specialize in recognizing relative locations of objects and discourages false activations outside object regions.

6. Relation to Prior Methods

The method is compared with R-FCN (Dai et al., 2016) and DeepMask (Pinheiro et al., 2015) (Dai et al., 2016). R-FCN also uses k2k^2 position-sensitive maps, but only for classification and bounding box regression; IS-FCN extends the idea to pixel-accurate segmentation masks. In contrast, DeepMask uses hundreds of proposal-specific networks or branches (one per window location), causing high computational and memory overhead. The position-sensitive mapping approach in IS-FCN requires only a single pass to compute k2k^2 shared maps and assembles per-proposal masks via efficient cropping and interpolation, yielding an order-of-magnitude reduction in per-proposal computation and significant memory savings.

Empirical results show that this architecture permits end-to-end training for both localization and segmentation and provides competitive instance segmentation performance on PASCAL VOC and MS COCO benchmarks (see Table 1 in (Dai et al., 2016) for ablation by kk value).

7. Practical Considerations and Impact

The primary advantages of position-sensitive score maps are computational efficiency, low memory overhead, and the capacity for precise spatial localization within object proposals. By transforming the segmentation problem into the assembly of local spatial cues from compact, shared maps, IS-FCN and similar architectures enable scalable instance segmentation with relatively modest architectural modifications. A plausible implication is that this strategy facilitates the extension of FCN-based semantic segmentation architectures to instance-aware tasks with minimal increase in inference cost, and provides a general-purpose template for designing parsimonious prediction heads for spatially-structured outputs (Dai et al., 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Position-Sensitive Score Maps.