Mask-SSIM Score Analysis

Updated 8 December 2025

Mask-SSIM Score is a metric that extends SSIM by incorporating spatially adaptive masking to better capture perceptual differences in segmentation outputs.
It employs both hand-crafted and learned masking strategies to align error weighting with human visual sensitivity, improving evaluation accuracy.
Applications span image restoration and clinical segmentation tasks, where precise boundary delineation and perceptual fidelity are critical.

Mask-SSIM Score quantifies the structural similarity between two images or image-like data—typically a predicted and a reference segmentation mask—by incorporating spatially adaptive masking into the Structural Similarity Index Measure (SSIM). Its primary aim is to either (1) provide a more perceptually calibrated assessment of image differences by explicitly modeling human visual masking effects, or (2) evaluate the agreement between binary or soft segmentation outputs and ground truth, with or without explicit weighting. The term thus covers a family of methodologies, ranging from standard SSIM applied to masks to learned, per-pixel modulated SSIM (“Masked SSIM”) that better aligns automated metric output with human opinion scores.

1. Theoretical Foundation of SSIM and Mask-SSIM

The Structural Similarity Index Measure (SSIM) compares local patterns of pixel intensities normalized for luminance and contrast across two images. Given single-channel inputs $X$ and $Y$ , SSIM is evaluated over local windows $\omega$ of fixed size (e.g., $11\times11$ with Gaussian weighting), producing a local score at position $p$ : $s(p) = \frac{(2\mu_X(p)\mu_Y(p)+C_1)(2\sigma_{XY}(p)+C_2)}{(\mu_X(p)^2 + \mu_Y(p)^2 + C_1)(\sigma_X^2(p) + \sigma_Y^2(p) + C_2)},$ with local mean and covariance defined within $\omega$ . The image-level SSIM is the unweighted mean: $\mathrm{SSIM}(X,Y) = \frac{1}{|\Omega|}\sum_{p \in \Omega} s(p),$ where $\Omega$ denotes all pixel centers for which a full window is defined.

The Mask-SSIM concept generalizes this by incorporating a per-pixel mask $M$ , applied as a spatial modulator: $X'(p) = M(p)\,X(p), \qquad Y'(p) = M(p)\,Y(p),$ yielding

$\mathrm{Mask\text{-}SSIM}(X, Y) = \mathrm{SSIM}(X', Y').$

An alternative but mathematically similar formulation interprets $M$ as directly weighting the SSIM map: $\mathrm{Mask\text{-}SSIM}(X, Y) = \frac {\sum_{p\in\Omega} M(p) s(p)}{\sum_{p\in\Omega} M(p)}.$ This approach underlies both hand-crafted and learned weighting strategies (Çoğalan et al., 2023).

2. Computing Mask-SSIM for Segmentation Tasks

In segmentation evaluation contexts, both predicted mask $\hat{M}(x, y)$ and ground-truth mask $M(x, y)$ are converted to single-channel binarized images (values in $\{0,1\}$ ), then linearly mapped to $[0,1]$ floats. No further smoothing or enhancement is applied beyond the Gaussian SSIM window. For each test image in the evaluation set, SSIM is computed as specified above; the mean over all test images constitutes the reported SSIM (or Mask-SSIM) score. No region-based or instance-based weighting is applied; all pixels across all test images are treated equally (Peter et al., 8 Aug 2025).

3. Integration of Self-Supervised Visual Masking

A distinct but closely related methodology, as proposed by Coğalan et al., automatically learns a spatial visual mask $M \in [0,1]^{H \times W}$ via self-supervised training. The mask generator $\mathcal{F}$ (small CNN, 3 $\times$ 3 convolutions + Sigmoid) accepts both reference and distorted images, outputting $M$ , which then modulates input images prior to SSIM computation: $\mathrm{Mask\text{-}SSIM}(X, Y) = \mathrm{SSIM}(M \odot X, M \odot Y).$ The network is optimized with mean squared error between a scaled Mask-SSIM score and ground-truth mean opinion scores (MOS) over a diverse reference-distortion dataset. This approach endows the mask with properties such as contrast-dependent masking, improved error localization, and alignment with human visual sensitivity. The learned Mask-SSIM improves correlation with MOS across multiple datasets (CSIQ, TID2013, PIPAL), outperforming both standard SSIM and naive saliency-based weighting (Çoğalan et al., 2023).

4. Application to Polyp Segmentation: Protocol and Results

In automated polyp segmentation, Mask-SSIM (as standard SSIM on masks) is used to assess the spatial and structural agreement between predicted and annotated regions. The evaluation protocol in (Peter et al., 8 Aug 2025) comprises:

Pre-processing: Binarization and scaling of both $\hat{M}$ and $M$ to $[0,1]$ .
SSIM parameters: Gaussian window of size $11\times11$ , $\sigma=1.5$ , $K_1 = 0.01$ , $K_2 = 0.03$ , $L=1$ .
Aggregation: Mean SSIM over all test set images.

Performance of five architectures (U-Net, PSPNet, FPN, LinkNet, MANet—all with ResNet34 backbone) is summarized below:

Model	PSNR	SSIM
U-Net	7.050282	0.491042
PSPNet	6.637351	0.445212
FPN	7.205893	0.492381
LinkNet	7.014127	0.475003
MANet	7.073846	0.476115

FPN achieves the highest SSIM (0.492381), attributed to its multi-scale fusion (top-down/lateral connections), synergy with synthetic data augmentation, and the use of a hybrid loss (BCE + Dice + Focal) that promotes precise boundary delineation. The slightly superior PSNR for FPN further suggests reduced noise and higher fidelity in mask generation (Peter et al., 8 Aug 2025).

5. Interpretation and Limitations

SSIM, when applied to segmentation masks, focuses on spatial correlation, boundary continuity, and contrast. It is more sensitive to structure (e.g., boundary alignment, local object shape) than pixelwise region-overlap metrics such as IoU or Dice. However, several limitations are recognized:

SSIM was developed for continuous-tone images, so its use on binary masks can under-penalize minor boundary misalignments if large homogenous regions agree.
Unlike IoU or Dice, SSIM is more attuned to local contrast but less to global overlap.
Without explicit region-of-interest weighting, SSIM gives equal importance to errors in any part of the image, including background, which may not be optimal for tasks where boundary fidelity is critical.
High SSIM may coexist with moderate discrepancies in region overlap, underscoring the need to employ SSIM in conjunction with IoU, Dice, or other complementary metrics (Peter et al., 8 Aug 2025).

The learned Mask-SSIM paradigm further addresses some of these issues by weighting error contributions in a human-derived manner, lowering sensitivity in highly textured or low-salience regions and increasing it in smooth or perceptually important areas (Çoğalan et al., 2023).

6. Extensions and Contextual Impact

The Mask-SSIM framework is not limited to mask comparison; it can be generalized to any image similarity assessment where perceptual relevance varies spatially. By learning $M$ in a self-supervised or end-to-end fashion, Mask-SSIM can be adapted to distortion types, applications, or specific human performance proxies (e.g., MOS). Empirical results establish that this approach yields consistently higher correlation with human perception compared to standard SSIM or saliency-weighted variants across multiple large-scale datasets (KADID-10k, CSIQ, TID2013, PIPAL).

In clinical image segmentation and restoration, where both per-pixel accuracy and perceived boundary fidelity affect downstream decisions, Mask-SSIM (and its learned variants) offers a technically rigorous, perceptually informed assessment metric for both model evaluation and optimization (Peter et al., 8 Aug 2025, Çoğalan et al., 2023).