Estimating Blink Probability for Highlight Detection in Figure Skating Videos (2007.01089v1)

Published 2 Jul 2020 in cs.CV and cs.MM

Abstract: Highlight detection in sports videos has a broad viewership and huge commercial potential. It is thus imperative to detect highlight scenes more suitably for human interest with high temporal accuracy. Since people instinctively suppress blinks during attention-grabbing events and synchronously generate blinks at attention break points in videos, the instantaneous blink rate can be utilized as a highly accurate temporal indicator of human interest. Therefore, in this study, we propose a novel, automatic highlight detection method based on the blink rate. The method trains a one-dimensional convolution network (1D-CNN) to assess blink rates at each video frame from the spatio-temporal pose features of figure skating videos. Experiments show that the method successfully estimates the blink rate in 94% of the video clips and predicts the temporal change in the blink rate around a jump event with high accuracy. Moreover, the method detects not only the representative athletic action, but also the distinctive artistic expression of figure skating performance as key frames. This suggests that the blink-rate-based supervised learning approach enables high-accuracy highlight detection that more closely matches human sensibility.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a novel method that leverages viewer blink probability, derived from pose data, to identify highlights in figure skating videos.
It employs a 1D-CNN architecture trained on sliding windows of OpenPose-generated joint positions and synchronized blink data for high temporal accuracy.
The approach enables automated sports highlight generation and objective performance analysis, contingent on accurate pose estimation and viewer data.

This paper (2007.01089) proposes a novel method for automatically detecting highlights in figure skating videos based on estimating the probability of viewers blinking at each moment. The core idea is that viewers tend to suppress blinks during highly engaging scenes and blink synchronously at moments of decreased attention, making blink rate a potential indicator of human interest with high temporal accuracy.

The method involves the following practical steps:

Data Acquisition and Preparation:
- Video Data: Collect figure skating performance videos (e.g., competition footage).
- Blink Data (Ground Truth): This is a crucial step for training data creation. Measure the actual blink rate of a group of viewers while they watch these videos. This typically involves using eye-tracking technology (like near-infrared eye trackers mentioned in the paper) to monitor pupil diameter and detect blinks for each viewer. The blink rate for a given video frame is the percentage of viewers who blinked at that specific frame. This process is time-consuming and requires specialized equipment and participants.
- Pose Estimation: For each video frame, estimate the 2D joint positions of the skater. The paper uses OpenPose (1812.08008) to detect 18 body joints. Since videos may contain multiple people (skaters, judges, audience), implement logic to identify the primary subject (the skater). The paper uses the heuristic of selecting the person with the largest body area in the frame. Implement filtering for low-confidence joint detections (e.g., confidence score < 0.7) and interpolate missing or filtered data points to maintain continuous pose sequences.
- Input Feature Creation: Construct input samples for the model by creating sliding windows of pose data over time. The paper uses a window size of 3 seconds (90 frames at 30 FPS). For each window, the (X, Y) coordinates of the 18 joints across 90 frames are flattened into a 90 × 36 matrix. Each matrix serves as one input sample, and the target blink rate is the measured blink rate at the last frame of this 90-frame window. Slide this window typically with a step of 1 second (30 frames) to generate training samples covering the entire video.

2. Model Architecture and Training: * Model: A simple one-dimensional Convolutional Neural Network (1D-CNN) is used. The architecture takes the 90x36 pose matrix as input. It consists of three 1D convolutional layers with kernel size 8 and filters (64, 128, 64), followed by Batch Normalization, an Average Pooling layer, a Flatten layer, and a fully connected layer that outputs a single value representing the estimated blink probability rate (between 0 and 1). * Implementation: Implement the CNN using a deep learning framework like TensorFlow or PyTorch. * Training Process: Train the 1D-CNN using the collected pose feature windows as input and the measured blink rates (for the window's last frame) as target outputs. * Loss Function: Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) between the estimated and actual blink rate. * Optimizer: Adam (1412.6980) with a learning rate of 0.001. * Training Strategy: The paper used leave-one-out cross-validation on their dataset of 48 video clips, training on 47 and testing on 1, rotating through all clips. This is suitable for small datasets but computationally intensive. For larger datasets, a standard train/validation/test split would be more practical. * Hyperparameters: Batch size of 4096, maximum 100 epochs.

Highlight Detection (Inference and Thresholding):
- Estimation: Once the model is trained, process a new video frame by frame using the same sliding window approach (e.g., with a step of 1 frame for dense estimation). For each window, feed the pose features to the trained 1D-CNN to get the estimated blink probability rate for the last frame of the window. This generates a time series of estimated blink rates for the entire video.
- Thresholding: Identify highlights based on low estimated blink probability. The paper defines a highlight scene as a period where the estimated blink probability is at least two standard deviations below the mean estimated blink rate calculated over a short window (e.g., 5 frames).
  - Calculate the mean and standard deviation of the estimated blink rate over the entire video or a segment.
  - Set a threshold, e.g., Threshold = Mean - 2 * StdDev.
  - Mark frames or segments where the estimated blink rate falls below this threshold as potential highlights.

Implementation Considerations:

Computational Resources: Pose estimation (OpenPose) is computationally intensive, requiring a GPU. The 1D-CNN inference is relatively lightweight once trained but processing long videos frame-by-frame can still take time.
Data Requirements: Creating the training dataset requires significant effort to collect simultaneous video and viewer blink data. This is the main bottleneck for applying this method to new domains.
Pose Estimation Accuracy: The performance of the highlight detection is dependent on the accuracy of the initial pose estimation. Challenges like occlusions, complex movements, or non-standard camera angles can affect pose accuracy.
Domain Specificity: The model is trained on figure skating and correlates poses with blink rates specific to viewing figure skating. Applying it to other sports or content would likely require retraining the model on relevant data.
Threshold Tuning: The threshold for highlight detection (e.g., mean - 2*std dev) might need empirical tuning based on the desired sensitivity and density of highlights.

Real-World Applications:

Automated Sports Highlight Generation: Automatically create highlight reels for figure skating competitions for broadcasters, online platforms, or social media.
Content Analysis: Provide objective insights into which moments in a performance capture the most viewer attention, potentially useful for coaches, choreographers, or athletes.
Video Summarization: Create concise summaries of performances by selecting the detected highlight segments.
Potential for Other Content: If appropriate training data can be gathered, the concept could potentially be extended to detect engaging moments in other types of performance or presentation videos where visual attention is key.

The paper successfully demonstrates that a relatively simple 1D-CNN, trained on the relationship between skater pose dynamics and viewer blink rates, can effectively identify highlight moments in figure skating, including not only technical actions but also artistic expressions that engage viewers. This supervised learning approach, using blink rate as a proxy for attention, offers an alternative to methods based purely on video features or subjective annotations, with the potential for higher temporal accuracy reflecting viewer interest.

PDF Markdown

Estimating Blink Probability for Highlight Detection in Figure Skating Videos (2007.01089v1)

Summary

Related Papers