Fitness Done Right: a Real-time Intelligent Personal Trainer for Exercise Correction (1911.07935v1)

Published 30 Oct 2019 in cs.CV and eess.IV

Abstract: Keeping fit has been increasingly important for people nowadays. However, people may not get expected exercise results without following professional guidance while hiring personal trainers is expensive. In this paper, an effective real-time system called Fitness Done Right (FDR) is proposed for helping people exercise correctly on their own. The system includes detecting human body parts, recognizing exercise pose and detecting errors for test poses as well as giving correction advice. Generally, two branch multi-stage CNN is used for training data sets in order to learn human body parts and associations. Then, considering two poses, which are plank and squat in our model, we design a detection algorithm, combining Euclidean and angle distances, to determine the pose in the image. Finally, key values for key features of the two poses are computed correspondingly in the pose error detection part, which helps give correction advice. We conduct our system in real-time situation with error rate down to $1.2\%$, and the screenshots of experimental results are also presented.

Authors (3)

Yun Chen (134 papers)
Yiyue Chen (5 papers)
Zhengzhong Tu (71 papers)

Summary

The paper introduces Fitness Done Right, a real-time intelligent system for exercise correction using computer vision for keypoint detection, pose recognition, and error feedback.
The system utilizes a two-branch CNN to detect 17 keypoints and compares poses to a standard database using weighted Euclidean and angle distances.
It detects pose errors based on angle thresholds and refined 3D projections, providing real-time corrective advice with reported motion recognition error rates as low as 1.2%.

This paper introduces Fitness Done Right (FDR), a real-time system designed to assist individuals in performing exercises correctly without the need for a personal trainer. The system focuses on human body part detection, exercise pose recognition, error detection, and providing corrective advice. The system aims to address the challenges of exercising without professional guidance, which can lead to unsatisfactory results or injuries.

The FDR system operates through the following stages:

Keypoint Detection: Employs a two-branch multi-stage Convolutional Neural Network (CNN) to detect 17 keypoints on the human body, such as eyes, nose, shoulders, elbows, wrists, hips, knees, and ankles, as defined in the Microsoft COCO dataset.
Pose Estimation and Motion Recognition: Generates a representation vector for each image based on the detected keypoints. It then uses weighted Euclidean distance and weighted angle distance to compare a test image with images in a standard database to identify the pose.
Pose Error Detection and Correction: Detects errors in poses, such as incorrect back alignment during a plank or improper knee bending during a squat, and provides real-time corrective advice displayed on the screen.

Here's a more detailed breakdown:

Keypoint Detection

The paper utilizes a two-branch CNN model, inspired by previous work, to detect human body parts. The model takes an $w \times h$ image $\rm I$ as input and predicts confidence maps of human body parts, denoted as the set $S=\{S_1,..., S_j, j\in J\}$ , where $S_j\in \mathds{R}^{w\times h}$, and $J$ represents the set of human body joints. It also predicts human joint associations, denoted as the set $L=\{L_1,...,L_c, c\in C\}, L_c \in \mathds{R}^{w\times h\times 2}$, where $C$ is the set of human joint associations (e.g., limbs). The computation of $S_j$ is given by:

$S_j(x,y) = \exp (-\frac{||(x,y) - (x_j^*,y_j^*)||^2}{\upsigma})$

where:

$(x,y)$ is the pixel coordinate.
$(x_j^*,y_j^*)$ is the ground truth position of human body part $j$ .
$\upsigma$ controls the spreading rate.

The values of $S$ and $L$ at iteration $t$ are updated as:

$S^t = \rho ^t ({\rm I}, S^{t-1},L^{t-1})$

$L^{t} = \phi ^t({\rm I}, S^{t-1},L^{t-1})$

where $\rho^t$ and $\phi^t$ represent the outputs of the two-branch CNN at iteration $t$ . The loss functions for training are defined as:

$l^t_S = \sum \limits_{j}\sum \limits_{p = (x,y)} W(p)\cdot ||S^t_j(p) - S^*_j(p)||^2$

$l^t_L = \sum \limits_{c}\sum \limits_{p = (x,y)} W(p)\cdot ||L^t_j(p) - L^*_j(p)||^2$

where:

$S^*_j(p)$ and $L^*_j(p)$ are the ground truth values of $S_j$ and $L_j$ at position $p$ .
$W(p)$ is a binary mask to avoid penalizing true positive predictions.

The Adam (Adaptive Moment Estimation) optimizer is employed to dynamically adjust the learning rate.

To improve the accuracy of keypoint detection, especially for non-upright poses, the paper introduces multi-directional recognition, where the input image is rotated by 90 degrees and re-fed into the inference network. The approach mitigates issues related to training dataset bias towards upright poses. Additionally, a body part filling module is implemented to address challenges caused by overlapping or obscured body parts. This module uses heuristic anthropometry to estimate missing keypoints based on the positions of other detected keypoints.

Motion Recognition

The motion recognition stage involves constructing a 52-dimensional representation vector and a 12-dimensional angle feature vector for each image, which are then used to compute weighted Euclidean and angle distances.

The weighted Euclidean distance is calculated as:

$d_{E}=\frac{\sum_{k=1}^{17}\beta_{Ak}(|x_{Ak}-x_{Bk}|+|y_{Ak}-y_{Bk}|)}{\sum_{k=1}^{17}\beta_{Ak}}$

where:

$x_{Ak},x_{Bk},y_{Ak}, y_{Bk}$ represent coordinates of keypoints.
$\beta_{Ak}$ represents the confidence score of image A.

The weighted angle distance is calculated as:

$d_{A}=\frac{\sum_{k=1}^{12}\gamma_{Ak}(|\angle A_{k}-\angle B_{k}|)}{\sum_{k=1}^{12}\gamma_{Ak}}$

where:

$\angle A_{k}$ and $\angle B_{k}$ represent the k-th element of the angle feature vector of 2 images.
$\gamma_{Ak}$ is derived from the average confidence scores of the three keypoints forming the body joint of image A.

The final distance is a weighted combination of the Euclidean and angle distances, with the E-A ratio (Euclidean-to-Angle ratio) determining the relative importance of each.

Motion Error Detection

The motion error detection module identifies deviations from standard exercise form. For the plank pose, it checks the alignment of the back using the angle between the hips, shoulders, and feet. A plank is considered correct if:

$\angle ((x_h-x_s, y_h-y_f),(x_h-x_s, y_h-y_f)) > \mathcal{T}$

where:

$(x_h, y_h), (x_s, y_s), (x_f, y_f)$ are the coordinates of hips, shoulders and feet respectively.
$\mathcal{T}$ is a threshold degree.

If the angle is below the threshold $\mathcal{T}$ , the system determines whether the hips are too high or too low based on the vertical positions of the hips, shoulders, and feet.

For the squat pose, the system verifies that the knees are bent at approximately $\pi/2$ radians and that the body weight is shifted towards the heels. The knee bend is considered correct if:

$\angle ((x_k-x_h,y_k-y_h),(x_k-x_f, y_k-y_h)) \in \frac{\pi}{2}\pm \sigma$

where:

$(x_k,y_k), (x_h,y_h), (x_f,y_f)$ are the coordinates of knees, hips and feet respectively.
$\sigma$ is the error range allowed.

The system also assesses body weight distribution by calculating the ratio of the horizontal distance between the hips and heels to the length of the thigh, ensuring it falls within a defined range.

To account for variations in perspective due to 2D projections of 3D poses, the paper employs Least Square approximation to project 2D coordinates back into 3D space, refine the pose, and then project back to 2D for more accurate angle calculations. The projection matrix $\mathds{T}$ is calculated using:

$\mathds{T} = (\hat{\rm P_{2D}^{\mathsf{T}\cdot \hat{\rm P_{2D})^{-1}\cdot \hat{\rm P_{2D}^{\mathsf{T}\cdot \rm P_{3D}$

where:

$\rm P_{3D}$ represents the 3D coordinates of the human back.
$\hat{\rm P_{2D}$ represents the augmented 2D coordinates of the human back.

Experimental Results

The system's performance was evaluated through experiments using a standard database of 200 images, including 90 plank poses and 110 squat poses. The selection of the E-A ratio was found to impact motion detection accuracy, with higher E-A ratios (Euclidean distance having more weight) leading to lower error rates in plank detection. The motion recognition error rate was reported to be as low as 1.2%.

PDF Markdown

Fitness Done Right: a Real-time Intelligent Personal Trainer for Exercise Correction (1911.07935v1)

Summary

Related Papers