Percentage of Measured Phenotype (PMP)
- PMP is an evaluation metric that normalizes keypoint localization error by the associated phenotype length, ensuring biologically interpretable measurements.
- It overcomes the limitations of global normalization by accurately penalizing errors in small, critical features such as eye diameter.
- Empirical evidence shows that PMP improves model selection and reduces measurement error in automated phenotypic analysis in aquatic species.
The Percentage of Measured Phenotype (PMP) is an evaluation metric developed for quantifying the accuracy of keypoint detection in phenotypic analysis, particularly in fisheries and aquaculture contexts where precise morphometric characterization is critical. PMP addresses shortcomings of traditional pose-estimation metrics by assessing localization performance relative to the biological phenotype each keypoint helps define, thus providing biologically interpretable, phenotype-specific sensitivity to prediction errors (Liu et al., 2024).
1. Formal Definition
PMP is computed on a per-keypoint basis over a dataset of annotated specimens. Let denote the total number of test images, index anatomical keypoints, be the model-predicted coordinates for keypoint in image , and the corresponding ground truth. The relevant phenotype length for keypoint , image is denoted , which is the pixel distance between the two anatomical landmarks that define the phenotype associated with . A user-chosen threshold 0 determines permissible error; the canonical choice is 1.
The normalized squared error for each keypoint prediction is:
2
A keypoint prediction is counted as correct if 3. The PMP score for keypoint 4 is then:
5
where 6 is the indicator function.
2. Conceptual Motivation
Conventional metrics for keypoint localization, such as Percentage of Correct Keypoints (PCK) and Object Keypoint Similarity (OKS), normalize prediction error by a global body measure (e.g., head length or bounding-box size) applied uniformly across all keypoints. In phenotypic analysis, however, phenotype-defining distances (e.g., total length, eye diameter) can vary by orders of magnitude within an organism, so a fixed normalization biases the evaluation toward large-scale structures and fails to penalize errors in small, biologically important features. PMP mitigates this limitation by using the actual length of the phenotype each keypoint defines as the normalization scale, thus directly assessing the effect of keypoint errors on phenotype measurements.
3. Computational Procedure
PMP calculation follows a standardized workflow:
- For each test image 7 and each keypoint 8, retrieve ground truth 9 and the keypoints 0 that define 1.
- Compute the squared Euclidean error 2.
- Normalize by phenotype length: 3.
- Given threshold 4 (default 0.1), count as correct if 5.
- Aggregate over the dataset: 6.
Illustrative Example:
For eye diameter (ED) between keypoints 11 and 12, suppose 7, 8, and the predicted 9. Here, 0 pixels, 1, so 2, yielding a correct prediction.
4. Keypoint–Phenotype Mappings
Each keypoint is mapped to a specific phenotype, defined by a pair of anatomical landmarks. Table 1 summarizes several such mappings:
| Phenotype | Keypoint Pair (a, b) |
|---|---|
| Total length (TL) | (1, 9) |
| Head length (HL) | (1, 2) |
| Eye diameter (ED) | (11, 12) |
| Body depth (BD) | (5, 6) |
| Dorsal fin height (DFH) | (20, 22) |
This mapping ensures that each keypoint’s error is evaluated with respect to the phenotype it actually influences, rather than an arbitrary scale.
5. Interpretation and Threshold Selection
PMP values near 1.0 (e.g., 3) indicate that on at least 90% of samples, keypoint localization is precise enough for downstream phenotype measurement, with the critical distance determined by the square root of 4. Values near 0.5 or lower indicate unreliability for quantitatively rigorous applications. The selection of threshold 5 is domain-specific; smaller 6 enforces stricter localization, while larger 7 allows more deviation. A typical starting value is 8, corresponding to approximately 5% RMS localization error relative to the phenotype length.
6. Comparison with Established Metrics
PMP exhibits fundamental differences in both normalization and interpretability compared to PCK, OKS, and mean Average Precision (mAP):
- PCK: 9, global body scale 0; not phenotype-sensitive.
- OKS: 1, global object-level similarity.
- mAP: Ranks predictions by heatmap confidence, less direct for localization error.
By normalizing with respect to phenotype length, PMP is more sensitive to errors in small-scale features. For example, it penalizes equally sized pixel errors on eye diameter more stringently than on total length, reflecting their relative impact on phenotypic measurement. Experiments indicate that PMP yields lower absolute pixel deviations and is better aligned with the end-goal of phenotypic quantification (Liu et al., 2024).
7. Performance Evidence and Practical Impact
When used for model selection, PMP has been shown to correlate strongly with improved mean absolute percentage error (MAPE) in phenotype measurement compared to OKS and PCK. On common carp, models selected by PMP attained an average mMAPE of 8.0% versus 12.1% for OKS/PCK. PMP-guided selection produced quantitatively closer keypoints even on challenging anatomical landmarks (e.g., eye posterior/anterior, caudal fin base). Integrating PMP into the selection loop, regardless of model backbone (Hourglass or HRNet), consistently improved phenotype measurement across species. Use of the HRNet+ACR architecture selected by PMP reduced MAPE for common carp phenotype measurement from 6.6% (baseline) to 4.7%, with associated improvements in Pearson 2 and 3 (Liu et al., 2024). This suggests PMP is directly beneficial for high-precision biological measurement targets in automated image analysis pipelines.