- The paper introduces Fitness Done Right, a real-time intelligent system for exercise correction using computer vision for keypoint detection, pose recognition, and error feedback.
- The system utilizes a two-branch CNN to detect 17 keypoints and compares poses to a standard database using weighted Euclidean and angle distances.
- It detects pose errors based on angle thresholds and refined 3D projections, providing real-time corrective advice with reported motion recognition error rates as low as 1.2%.
This paper introduces Fitness Done Right (FDR), a real-time system designed to assist individuals in performing exercises correctly without the need for a personal trainer. The system focuses on human body part detection, exercise pose recognition, error detection, and providing corrective advice. The system aims to address the challenges of exercising without professional guidance, which can lead to unsatisfactory results or injuries.
The FDR system operates through the following stages:
- Keypoint Detection: Employs a two-branch multi-stage Convolutional Neural Network (CNN) to detect 17 keypoints on the human body, such as eyes, nose, shoulders, elbows, wrists, hips, knees, and ankles, as defined in the Microsoft COCO dataset.
- Pose Estimation and Motion Recognition: Generates a representation vector for each image based on the detected keypoints. It then uses weighted Euclidean distance and weighted angle distance to compare a test image with images in a standard database to identify the pose.
- Pose Error Detection and Correction: Detects errors in poses, such as incorrect back alignment during a plank or improper knee bending during a squat, and provides real-time corrective advice displayed on the screen.
Here's a more detailed breakdown:
Keypoint Detection
The paper utilizes a two-branch CNN model, inspired by previous work, to detect human body parts. The model takes an w×h image I as input and predicts confidence maps of human body parts, denoted as the set S={S1,...,Sj,j∈J}, where $S_j\in \mathds{R}^{w\times h}$, and J represents the set of human body joints. It also predicts human joint associations, denoted as the set $L=\{L_1,...,L_c, c\in C\}, L_c \in \mathds{R}^{w\times h\times 2}$, where C is the set of human joint associations (e.g., limbs). The computation of Sj is given by:
$S_j(x,y) = \exp (-\frac{||(x,y) - (x_j^*,y_j^*)||^2}{\upsigma})$
where:
- (x,y) is the pixel coordinate.
- (xj∗,yj∗) is the ground truth position of human body part j.
- $\upsigma$ controls the spreading rate.
The values of S and L at iteration t are updated as:
St=ρt(I,St−1,Lt−1)
Lt=ϕt(I,St−1,Lt−1)
where ρt and ϕt represent the outputs of the two-branch CNN at iteration t. The loss functions for training are defined as:
lSt=j∑p=(x,y)∑W(p)⋅∣∣Sjt(p)−Sj∗(p)∣∣2
lLt=c∑p=(x,y)∑W(p)⋅∣∣Ljt(p)−Lj∗(p)∣∣2
where:
- Sj∗(p) and Lj∗(p) are the ground truth values of Sj and Lj at position p.
- W(p) is a binary mask to avoid penalizing true positive predictions.
The Adam (Adaptive Moment Estimation) optimizer is employed to dynamically adjust the learning rate.
To improve the accuracy of keypoint detection, especially for non-upright poses, the paper introduces multi-directional recognition, where the input image is rotated by 90 degrees and re-fed into the inference network. The approach mitigates issues related to training dataset bias towards upright poses. Additionally, a body part filling module is implemented to address challenges caused by overlapping or obscured body parts. This module uses heuristic anthropometry to estimate missing keypoints based on the positions of other detected keypoints.
Motion Recognition
The motion recognition stage involves constructing a 52-dimensional representation vector and a 12-dimensional angle feature vector for each image, which are then used to compute weighted Euclidean and angle distances.
The weighted Euclidean distance is calculated as:
dE=∑k=117βAk∑k=117βAk(∣xAk−xBk∣+∣yAk−yBk∣)
where:
- xAk,xBk,yAk,yBk represent coordinates of keypoints.
- βAk represents the confidence score of image A.
The weighted angle distance is calculated as:
dA=∑k=112γAk∑k=112γAk(∣∠Ak−∠Bk∣)
where:
- ∠Ak and ∠Bk represent the k-th element of the angle feature vector of 2 images.
- γAk is derived from the average confidence scores of the three keypoints forming the body joint of image A.
The final distance is a weighted combination of the Euclidean and angle distances, with the E-A ratio (Euclidean-to-Angle ratio) determining the relative importance of each.
Motion Error Detection
The motion error detection module identifies deviations from standard exercise form. For the plank pose, it checks the alignment of the back using the angle between the hips, shoulders, and feet. A plank is considered correct if:
∠((xh−xs,yh−yf),(xh−xs,yh−yf))>T
where:
- (xh,yh),(xs,ys),(xf,yf) are the coordinates of hips, shoulders and feet respectively.
- T is a threshold degree.
If the angle is below the threshold T, the system determines whether the hips are too high or too low based on the vertical positions of the hips, shoulders, and feet.
For the squat pose, the system verifies that the knees are bent at approximately π/2 radians and that the body weight is shifted towards the heels. The knee bend is considered correct if:
∠((xk−xh,yk−yh),(xk−xf,yk−yh))∈2π±σ
where:
- (xk,yk),(xh,yh),(xf,yf) are the coordinates of knees, hips and feet respectively.
- σ is the error range allowed.
The system also assesses body weight distribution by calculating the ratio of the horizontal distance between the hips and heels to the length of the thigh, ensuring it falls within a defined range.
To account for variations in perspective due to 2D projections of 3D poses, the paper employs Least Square approximation to project 2D coordinates back into 3D space, refine the pose, and then project back to 2D for more accurate angle calculations. The projection matrix $\mathds{T}$ is calculated using:
$\mathds{T} = (\hat{\rm P_{2D}^{\mathsf{T}\cdot \hat{\rm P_{2D})^{-1}\cdot \hat{\rm P_{2D}^{\mathsf{T}\cdot \rm P_{3D}$
where:
- P3D represents the 3D coordinates of the human back.
- $\hat{\rm P_{2D}$ represents the augmented 2D coordinates of the human back.
Experimental Results
The system's performance was evaluated through experiments using a standard database of 200 images, including 90 plank poses and 110 squat poses. The selection of the E-A ratio was found to impact motion detection accuracy, with higher E-A ratios (Euclidean distance having more weight) leading to lower error rates in plank detection. The motion recognition error rate was reported to be as low as 1.2%.