Keypoint-Based Evaluation Metrics

Updated 25 December 2025

Keypoint-based evaluation metrics are quantitative criteria that measure the accuracy, consistency, and practical utility of detection algorithms across diverse domains.
They integrate classical geometric measures like AP/mAP with unified scores such as LRP/oLRP and human-centric metrics like semantic accuracy and richness.
These metrics provide actionable insights into performance, guiding improvements for applications ranging from sign language processing to medical imaging.

Keypoint-based evaluation metrics are quantitative criteria designed to assess the quality of keypoint detection algorithms across computer vision, medical imaging, 3D shape analysis, and sign language processing tasks. These metrics evaluate diverse aspects, including geometric accuracy, repeatability under nuisance transformations, semantic interpretability, downstream utility (e.g., morphology or geometry estimation), and practical deployment properties such as threshold interpretability and calibration. State-of-the-art literature features classical geometric metrics (repeatability, AP, mAP), cross-task measures (e.g., LRP/oLRP), semantic user-aligned ratings, and task-specific metrics such as phenotype sensitivity or two-view geometry suitability.

1. Classical and Geometric Keypoint Metrics

Traditional metrics for evaluating keypoint detectors include repeatability rates, Average Precision (AP), mean Average Precision (mAP), and localization-error–based quantities. These scores form the basis of most benchmarks for both hand-crafted (SIFT, Harris, ORB, etc.) and deep learned (SuperPoint, DISK, LF-Net) keypoint pipelines (Bartol et al., 2020).

Repeatability measures the fraction of keypoints consistently detected across different views or transforms, typically at a fixed tolerance $\tau$ in pixels.
Average Precision (AP) quantifies, for a ranked list of keypoint matches (e.g., by descriptor similarity), the ability to rank true correspondences ahead of false ones. For a list $L = \{(s_k,y_k)\}_{k=1}^K$ with distances $s_k$ and labels $y_k \in \{+1, -1\}$ , $AP = \frac{1}{P} \sum_{k : y_k = +1} \mathrm{Precision}@k$ . AP and its mean over multiple queries (mAP) are widely used in keypoint verification, image matching, and retrieval tasks.
Keypoint matching-based mAP is central for keypointed-image benchmarks, measuring how well correct matches dominate the candidate ranking (Bartol et al., 2020).

These geometric metrics prioritize correspondence or localization quality but do not account for the semantic meaning, distinctive utility, or downstream task relevance of the selected points.

2. Unified and Interpretable Composite Metrics

Recent work emphasizes the need for metrics that integrate localization error with detection (classification) errors, and provide actionable breakdowns.

Localisation Recall Precision (LRP) Error and Optimal LRP (oLRP) unify mis-localization, false positives, and false negatives into a single normalized measure applicable to keypoint detection (Oksuz et al., 2020). For keypoints, LRP at threshold $\delta$ and confidence $s$ is

$\mathrm{LRP}_{\mathrm{kpt}}(\delta, s) = \frac{ \sum_{\mathrm{TP}} \frac{d(p,g)}{\delta} + |\mathrm{FP}| + |\mathrm{FN}| }{|\mathrm{TP}| + |\mathrm{FP}| + |\mathrm{FN}|}$

where $\mathrm{TP}$ , $\mathrm{FP}$ , $\mathrm{FN}$ are counts after matching, and $d(p,g)$ is Euclidean distance. oLRP is the minimum LRP over confidence thresholds, yielding a robust operating point and immediate interpretation of error sources.

LRP/oLRP separates localization, FP, and FN contributions, enabling actionable error breakdown and is less sensitive to confidence distribution than AP. It is recommended for complementary reporting alongside AP.

3. Semantically-Aligned and Human-Centric Metrics

Semantic interpretability is not captured by purely geometric criteria. The Keypoint Autoencoder framework introduces two subjective yet rigorously operationalized metrics:

Semantic Accuracy: Measures, via human ratings, how many detected keypoints correspond to semantically meaningful object parts (e.g., leg end, wing tip) (Shi et al., 2020).
Semantic Richness: Measures the coverage of distinct semantic parts by the detected keypoints. High richness implies distributed selection across the object’s functional subcomponents, rather than clustering on a single part.

Mathematically, both are computed as Mean Opinion Scores:

$SA(D) = \frac{1}{PM} \sum_{p=1}^{P} \sum_{m=1}^M s_{p,m}^{acc}$

$SR(D) = \frac{1}{PM} \sum_{p=1}^{P} \sum_{m=1}^M s_{p,m}^{rich}$

where $s_{p,m}^{acc}$ , $s_{p,m}^{rich}$ are user scores for example $p$ and rater $m$ , for detector $D$ , with $P$ test clouds and $M$ raters.

These metrics, while human-driven and domain-specific, directly evaluate the high-level utility or interpretability of keypoints for downstream semantic applications. In comparative evaluations, semantically-aware methods (e.g., KAE, AC-KAE) exhibit substantially higher semantic accuracy and richness than repeatability-optimized baselines (Shi et al., 2020).

4. Task-Specific, Phenotype-Aware, and Anatomy-Guided Metrics

General-purpose keypoint criteria can be inadequate when downstream measurement sensitivity is highly local or phenotype-dependent, as in fish or anatomical morphometry.

Percentage of Measured Phenotype (PMP): Defined for each keypoint $j$ as the fraction of instances where the localization error is under a task-meaningful threshold (e.g., 10% of eye diameter), normalized by phenotype length (Liu et al., 2024):

$d_{j,n} = \frac{\|p_{j,n} - g_{j,n}\|^2}{\mathrm{pheno}_{j,n}}$

$\mathrm{PMP}(j) = \frac{1}{N} \sum_{n=1}^N \mathbb{I}[d_{j,n} < r]$

Unlike PCK or OKS, PMP normalizes error per landmark by the actual phenotypic measurement and applies per-keypoint, enabling sensitivity to small, critical structures. Empirically, selecting models by highest PMP yields halved error for tiny landmarks compared to PCK/OKS selection.

Anatomically-Calibrated Regularization (ACR) is a complementary loss for training, enforcing that predictions land within anatomically plausible regions, further improving PMP (Liu et al., 2024).

Significance: These approaches exemplify the need for task-specific evaluation, where the metric’s normalization and thresholding reflect the measurement purpose, not just generic object scale.

5. Robust, Interpretable Keypoint Quality Scoring

Direct interpretability and comparability of individual keypoint qualities are promoted by robust statistical frameworks such as GMM-IKRS (Santellani et al., 2024):

Robustness Score ( $\mathrm{robustness}_k$ ): Empirical detection probability—number of images/warps in which the keypoint is repeatedly detected within a tight radius.
Deviation Score ( $\mathrm{deviation}_k$ ): Measures localization spread (diameter of the 3σ confidence disk, specifically $6 \sigma_k$ ).

Given keypoints projected from $W$ warped images and fitted by a robust GMM, sorting by descending robustness and ascending deviation provides a 2D interpretable quality ordering that is agnostic to detector design. The approach has demonstrated boosts in repeatability, matching scores, and downstream geometric recovery (Santellani et al., 2024).

6. Geometry-Task-Driven and Theoretically Motivated Metrics

For downstream tasks such as two-view geometry estimation, keypoint utility depends both on repeatability and on expected measurement error under transformations:

Expected Measurement Error (EME, $\overline\eta_i(\beta)$ ): The expected $L_2$ deviation between a keypoint and its detected locations under random homographies of difficulty $\beta$ (Pakulev et al., 24 Mar 2025).
Bounded Neural Stability Score (BoNeSS-ST, $s_i$ ):

$\epsilon_i^{\rm bound} = \sqrt{\mathrm{tr}(\Sigma) + \|\mathbb{E}[\tilde k] - k_i\|^2 }$

$s_i = \exp(-\epsilon_i^{\rm bound})$

This score quantifies both spatial uncertainty and repeatability. BoNeSS-ST is used for selecting or learning keypoints with optimal expected downstream performance. Empirically, such metrics guide detectors to outperform classical approaches, even when traditional repeatability or inlier count metrics fail to discriminate (Pakulev et al., 24 Mar 2025).

7. Sequence- and Temporal-Distance-Based Metrics for Posed Data

In video and articulated pose tasks (e.g., sign language evaluation), framewise and sequence-aligned distance-based metrics are standard (Jiang et al., 8 Oct 2025):

Average Position Error (APE/MPJPE): Mean $L_2$ error per joint per frame.
Normalized APE (nAPE): Scale-invariant variant dividing by a reference skeleton measurement.
DTW-Aligned Mean Joint Error (DTW-MJE): Uses dynamic time warping for optimal alignment between predicted and reference pose sequences, robust to timing shifts.

Empirical studies recommend tuning keypoint selection, normalization, and fill strategies for maximal human-judgement correlation. DTW-MJE is robust for open-ended signing; segment-padded or hands-only variants (as in the pose-evaluation toolkit) further improve evaluation reliability (Jiang et al., 8 Oct 2025).

Metric Class	Normalization/Scale	Knowledge Source
Standard AP/mAP	None / per-image	Descriptor/label
LRP/oLRP	Geometric/user-tuned	Detector output
Semantic Accuracy	Human opinion	User raters
PMP	Phenotype length	Annotated anatomy
Robustness/Deviation	Warps, pixel spread	GMM fit
EME/BoNeSS-ST	Homography noise model	Theory + samples
DTW-MJE/nAPE	Skeleton/sequence scale	Pose sequence

These keypoint-based evaluation metrics address a broad range of criteria, from the geometric to the semantic, from global ranking to local error, and from human-aligned interpretability to task-specific utility. Selecting an appropriate metric requires alignment with the ultimate deployment context, sensitivity to both false positive/negative rates and localization precision, and, in emerging domains, human-driven or anatomy-motivated supervision.