Percentage of Measured Phenotype (PMP)

Updated 29 May 2026

PMP is an evaluation metric that normalizes keypoint localization error by the associated phenotype length, ensuring biologically interpretable measurements.
It overcomes the limitations of global normalization by accurately penalizing errors in small, critical features such as eye diameter.
Empirical evidence shows that PMP improves model selection and reduces measurement error in automated phenotypic analysis in aquatic species.

The Percentage of Measured Phenotype (PMP) is an evaluation metric developed for quantifying the accuracy of keypoint detection in phenotypic analysis, particularly in fisheries and aquaculture contexts where precise morphometric characterization is critical. PMP addresses shortcomings of traditional pose-estimation metrics by assessing localization performance relative to the biological phenotype each keypoint helps define, thus providing biologically interpretable, phenotype-specific sensitivity to prediction errors (Liu et al., 2024).

1. Formal Definition

PMP is computed on a per-keypoint basis over a dataset of annotated specimens. Let $N$ denote the total number of test images, $j \in \{1 \ldots 22\}$ index anatomical keypoints, $p_{j,n} \in \mathbb{R}^2$ be the model-predicted coordinates for keypoint $j$ in image $n$ , and $g_{j,n} \in \mathbb{R}^2$ the corresponding ground truth. The relevant phenotype length for keypoint $j$ , image $n$ is denoted $\operatorname{pheno}_{j,n} > 0$ , which is the pixel distance between the two anatomical landmarks that define the phenotype associated with $j$ . A user-chosen threshold $j \in \{1 \ldots 22\}$ 0 determines permissible error; the canonical choice is $j \in \{1 \ldots 22\}$ 1.

The normalized squared error for each keypoint prediction is:

$j \in \{1 \ldots 22\}$ 2

A keypoint prediction is counted as correct if $j \in \{1 \ldots 22\}$ 3. The PMP score for keypoint $j \in \{1 \ldots 22\}$ 4 is then:

$j \in \{1 \ldots 22\}$ 5

where $j \in \{1 \ldots 22\}$ 6 is the indicator function.

2. Conceptual Motivation

Conventional metrics for keypoint localization, such as Percentage of Correct Keypoints (PCK) and Object Keypoint Similarity (OKS), normalize prediction error by a global body measure (e.g., head length or bounding-box size) applied uniformly across all keypoints. In phenotypic analysis, however, phenotype-defining distances (e.g., total length, eye diameter) can vary by orders of magnitude within an organism, so a fixed normalization biases the evaluation toward large-scale structures and fails to penalize errors in small, biologically important features. PMP mitigates this limitation by using the actual length of the phenotype each keypoint defines as the normalization scale, thus directly assessing the effect of keypoint errors on phenotype measurements.

3. Computational Procedure

PMP calculation follows a standardized workflow:

For each test image $j \in \{1 \ldots 22\}$ 7 and each keypoint $j \in \{1 \ldots 22\}$ 8, retrieve ground truth $j \in \{1 \ldots 22\}$ 9 and the keypoints $p_{j,n} \in \mathbb{R}^2$ 0 that define $p_{j,n} \in \mathbb{R}^2$ 1.
Compute the squared Euclidean error $p_{j,n} \in \mathbb{R}^2$ 2.
Normalize by phenotype length: $p_{j,n} \in \mathbb{R}^2$ 3.
Given threshold $p_{j,n} \in \mathbb{R}^2$ 4 (default 0.1), count as correct if $p_{j,n} \in \mathbb{R}^2$ 5.
Aggregate over the dataset: $p_{j,n} \in \mathbb{R}^2$ 6.

Illustrative Example:

For eye diameter (ED) between keypoints 11 and 12, suppose $p_{j,n} \in \mathbb{R}^2$ 7, $p_{j,n} \in \mathbb{R}^2$ 8, and the predicted $p_{j,n} \in \mathbb{R}^2$ 9. Here, $j$ 0 pixels, $j$ 1, so $j$ 2, yielding a correct prediction.

4. Keypoint–Phenotype Mappings

Each keypoint is mapped to a specific phenotype, defined by a pair of anatomical landmarks. Table 1 summarizes several such mappings:

Phenotype	Keypoint Pair (a, b)
Total length (TL)	(1, 9)
Head length (HL)	(1, 2)
Eye diameter (ED)	(11, 12)
Body depth (BD)	(5, 6)
Dorsal fin height (DFH)	(20, 22)

This mapping ensures that each keypoint’s error is evaluated with respect to the phenotype it actually influences, rather than an arbitrary scale.

5. Interpretation and Threshold Selection

PMP values near 1.0 (e.g., $j$ 3) indicate that on at least 90% of samples, keypoint localization is precise enough for downstream phenotype measurement, with the critical distance determined by the square root of $j$ 4. Values near 0.5 or lower indicate unreliability for quantitatively rigorous applications. The selection of threshold $j$ 5 is domain-specific; smaller $j$ 6 enforces stricter localization, while larger $j$ 7 allows more deviation. A typical starting value is $j$ 8, corresponding to approximately 5% RMS localization error relative to the phenotype length.

6. Comparison with Established Metrics

PMP exhibits fundamental differences in both normalization and interpretability compared to PCK, OKS, and mean Average Precision (mAP):

PCK: $j$ 9, global body scale $n$ 0; not phenotype-sensitive.
OKS: $n$ 1, global object-level similarity.
mAP: Ranks predictions by heatmap confidence, less direct for localization error.

By normalizing with respect to phenotype length, PMP is more sensitive to errors in small-scale features. For example, it penalizes equally sized pixel errors on eye diameter more stringently than on total length, reflecting their relative impact on phenotypic measurement. Experiments indicate that PMP yields lower absolute pixel deviations and is better aligned with the end-goal of phenotypic quantification (Liu et al., 2024).

7. Performance Evidence and Practical Impact

When used for model selection, PMP has been shown to correlate strongly with improved mean absolute percentage error (MAPE) in phenotype measurement compared to OKS and PCK. On common carp, models selected by PMP attained an average mMAPE of 8.0% versus 12.1% for OKS/PCK. PMP-guided selection produced quantitatively closer keypoints even on challenging anatomical landmarks (e.g., eye posterior/anterior, caudal fin base). Integrating PMP into the selection loop, regardless of model backbone (Hourglass or HRNet), consistently improved phenotype measurement across species. Use of the HRNet+ACR architecture selected by PMP reduced MAPE for common carp phenotype measurement from 6.6% (baseline) to 4.7%, with associated improvements in Pearson $n$ 2 and $n$ 3 (Liu et al., 2024). This suggests PMP is directly beneficial for high-precision biological measurement targets in automated image analysis pipelines.

Markdown Report Issue Upgrade to Chat

References (1)

Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture Breeding (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Percentage of Measured Phenotype (PMP).