Heatmap-Based Loss (HTC-loss) in Pose Estimation

Updated 17 November 2025

Heatmap-Based Loss (HTC-loss) is a method that weights per-pixel errors based on ground-truth heatmap values to focus training on keypoint regions.
It leverages convex, monotonically increasing functions—such as linear, power, or exponential—to parameterize spatial weighting and enhance localization.
Empirical evaluations on benchmarks like COCO show modest AP improvements with minimal computational overhead during training.

The Heatmap-Based Loss, referred to as Heatmap-Weighting Loss or HTC-loss, is an approach for supervised training of heatmap-based keypoint detection networks, specifically focusing gradient energy around keypoints by weighting per-pixel errors according to the information content of the ground-truth heatmap. Introduced by Li and Xiang in "Lightweight Human Pose Estimation Using Heatmap-Weighting Loss" (Li et al., 2022), HTC-loss provides a simple generalization of the ubiquitous mean-squared error (MSE) by leveraging convex and monotonically increasing functions of the ground-truth heatmap to parameterize spatial weighting, resulting in modest but measurable improvements to detection accuracy in human pose estimation tasks while incurring negligible computational overhead.

1. Mathematical Formulation

Let $J$ denote the number of joint types in the dataset (e.g., $J=17$ for COCO), and let $P_j \in \mathbb{R}^{H\times W}$ be the predicted heatmap for joint $j$ , with the corresponding ground-truth heatmap $G_j(u,v)$ constructed as a 2D Gaussian centered at the annotated location $(u_j^*, v_j^*)$ :

$G_j(u,v) = \exp\left(-\frac{(u-u_j^*)^2 + (v-v_j^*)^2}{2\sigma^2}\right)$

where $\sigma$ controls the spread.

To focus loss gradients near significant pixels, HTC-loss uses a convex, monotonic weight-generation function $F: [0,1] \to \mathbb{R}_{\geq 0}$ applied to the ground-truth heatmap value at each pixel:

Linear: $F(x) = kx$ ( $k>0$ )
Power: $F(x) = x^\alpha$ ( $\alpha > 1$ )
Exponential: $F(x) = \exp(\beta x)$ ( $\beta>0$ )

Each pixel's weight is defined as $w_j(u,v) = F(G_j(u,v)) + 1$ , so background pixels ( $G_j(u,v)\approx 0$ ) retain weight $1$.

HTC-loss is then computed as the average weighted MSE across joints:

$L_{HTC} = \frac{1}{J} \sum_{j=1}^{J} \sum_{u=1}^{H} \sum_{v=1}^{W} w_j(u,v)\bigl(P_j(u,v) - G_j(u,v)\bigr)^2$

Equivalently, in matrix form:

$L_{HTC} = \frac{1}{J} \sum_{j=1}^J \langle W_j,\, (P_j-G_j)\circ(P_j-G_j)\rangle_F$

where $\circ$ denotes element-wise squaring, and $\langle\cdot,\cdot\rangle_F$ is the Frobenius inner product.

2. Algorithm and Implementation Details

The computation of HTC-loss admits both loop-based and vectorized implementations. Pseudocode for the elementary form is:

function HTC_Loss(P[J][H][W], G[J][H][W], F):
    J, H, W = dimensions of P, G
    total_loss = 0.0
    for j in 1..J:
        loss_j = 0.0
        for u in 1..H:
            for v in 1..W:
                g = G[j][u][v]
                w = F(g) + 1.0
                diff = P[j][u][v] - g
                loss_j += w * diff * diff
        total_loss += loss_j / (H*W)
    return total_loss / J

Efficient frameworks such as PyTorch or TensorFlow support:

1
2
3

W = F(G) + 1   # [J,H,W]
diff2 = (P - G) ** 2
loss = torch.mean(torch.sum(W * diff2, dim=(1,2)))  # Averaged over joints

Back-propagation through HTC-loss requires only computing the gradient with respect to

P_j(u,v)

$\frac{\partial L}{\partial P_j(u,v)} = \frac{2}{J} w_j(u,v)(P_j(u,v) - G_j(u,v))$

Because $w_j(u,v)$ is determined solely by $G$ (the fixed target), standard autodiff subroutines (e.g., autograd) treat HTC-loss as a simple weighted MSE.

3. Role in Training Regimen and Regularization

HTC-loss fully substitutes the standard MSE (unweighted pixelwise square error), with no additional bespoke regularizers applied to the keypoint head. Standard weight decay, data-augmentation, and optimization strategies (Adam with linear warm-up and stepwise decay) are retained. There are no auxiliary terms (such as shape or edge enforcement) incorporated into the HTC-loss for human pose estimation.

4. Hyperparameter Strategies and Design Choices

The authors of (Li et al., 2022) report empirical gains using the linear mapping $F(x) = x$ , which shifts per-pixel weight from $1$ (background) to $2$ (center of the keypoint). Steeper functions— $F(x) = 2x$ , $F(x) = x^2$ , $F(x) = \exp(x)$ —or greater slopes provide negligible or slightly reduced gains. An empirical rule is to select $F$ such that $\max w = F(1) + 1$ remains in $[1,2]$ to avoid over-focusing penalty on only the peak pixel, which may impair robust spatial context learning. This suggests careful tuning of $F$ for domain-specific heatmap characteristics.

5. Empirical Evaluation and Ablation Analysis

Ablation paper results on COCO val2017 (input size $256 \times 192$ ) demonstrate the following performance across different $F$ choices:

Weight Function	AP	AP50	AP75	AR
None (vanilla MSE)	65.56	87.36	73.97	71.65
$F(x)=x$	65.83	87.70	74.06	72.06
$F(x)=2x$	65.59	87.37	74.01	71.90
$F(x)=x^2$	65.65	87.70	73.96	71.81
$F(x)=\exp(x)$	65.70	87.66	73.74	71.79

The HTC-loss model trained with $F(x)=x$ achieves $65.3$ AP on COCO test-dev with a $256 \times 192$ input, compared to $64.1$ AP for comparable SimpleBaseline + MobileNetV2 using vanilla MSE.

6. Computational and Practical Impact

HTC-loss introduces negligible computational overhead during training, as the per-pixel weighting comprises an element-wise application and addition per pixel (on the order of nanoseconds). There is no reported destabilization of optimization or impact on overall epoch duration; final training time is unchanged, with potential for slightly faster convergence near keypoints in early epochs.

Inference speed and resource usage are unaffected: HTC-loss operates only during training. The pose estimation network (MobileNetV3 backbone, depthwise deconvolution head, attention, HTC-loss training) attains $55$ FPS on a mobile-class GPU (GTX1650Ti) and $18$ FPS on CPU—results commensurate with original MSE-trained models.

7. Visualizations and Qualitative Outcomes

Figures in (Li et al., 2022) illustrate the shape of $F(x)$ , with $F(x)=x$ representing a ramp from $1$ to $2$ weight as $G$ transitions $0 \to 1$ . Weight maps $W(u,v)=1+G(u,v)$ appear as "domes" centered on each keypoint, spatially targeting loss gradients to regions of highest annotation certainty. Qualitative visual comparison (Figure 1) shows that HTC-loss yields sharper, more localized keypoints, especially notable for smaller instances or cases of highly articulated persons. The vanilla MSE-trained baseline tends to produce comparatively blurrier peaks, lacking the precise spatial focus that characterizes HTC-loss-optimized networks.

In summary, Heatmap-Weighting Loss (HTC-loss) is a straightforward method for enhancing supervision in heatmap-based keypoint detection models. By upweighting errors near ground-truth keypoints via simple convex functions of the heatmap, HTC-loss improves localization accuracy by $+0.2$ to $+0.3$ AP on challenging benchmarks, imposes essentially zero additional resource or computational cost, and is immediately compatible with existing training and optimization pipelines (Li et al., 2022).

PDF Markdown Chat (Pro)

References (1)

Lightweight Human Pose Estimation Using Heatmap-Weighting Loss (2022)

Follow Topic

Get notified by email when new papers are published related to Heatmap-Based Loss (HTC-loss).