Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation (2103.11264v4)

Published 20 Mar 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH

Citations (61)

View on Semantic Scholar

Summary

The paper proposes TW-FINCH, an unsupervised method that detects semantically coherent action boundaries in untrimmed videos.
It employs a 1-nearest neighbor graph with temporal weighting to hierarchically partition video frames while preserving sequential order.
TW-FINCH outperforms existing methods with a 10.5% improvement in MoF and a 48.2% F1-Score on key benchmarks, reducing annotation overhead.

Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation

The paper presents an unsupervised method for action segmentation in untrimmed videos, addressing the challenge of identifying semantically consistent action boundaries without requiring training or detailed annotations. The approach centers on a novel temporally-weighted hierarchical clustering algorithm that effectively groups video frames into coherent actions.

Methodological Insights

The central contribution is the proposed Temporally-Weighted First NN Clustering Hierarchy (TW-FINCH), which leverages a 1-nearest neighbor graph representation of video frames, factoring in temporal progression. This graph captures both visual and temporal proximity by modulating frame distances with their temporal difference, forming a sparse adjacency graph. Clusters are identified as connected components of this graph, providing a hierarchical partitioning that reflects action boundaries.

A distinctive aspect is the integration of temporal coherence, which is crucial for preserving action order within untrimmed video sequences. The recursive nature of the algorithm allows for varying granularity in action segmentation, offering flexibility in application without the need for explicit training data.

Strong Numerical Results

TW-FINCH demonstrates significant performance improvements over existing unsupervised methods across five benchmark datasets: Breakfast, Inria Instructional Videos, 50Salads, MPII Cooking 2, and Hollywood Extended. For instance, on the Breakfast dataset, TW-FINCH achieves a notable improvement over the best-reported unsupervised method, VTE-UNET, with a 10.5% gain in MoF.

Further, the method surpasses both baseline clustering approaches and weakly-supervised methods. It consistently provides better segmentation performance, accurately predicting action lengths and maintaining sequential coherence even in datasets with extensive background content, such as Inria Instructional Videos, where it achieved a 48.2% F1-Score.

Implications and Speculatives

Practically, TW-FINCH eliminates the need for time-consuming data annotation or model training, rendering it highly adaptable for real-world video analysis where data variations are significant. This unsupervised approach provides a robust solution for temporal segmentation, indicating potential extensions to more complex scenarios involving multi-label or composite actions.

Theoretically, the work suggests that, given a discriminative visual representation, clustering-based methods can achieve compelling results with minimal supervision. This insight opens pathways for further exploration of unsupervised techniques in video understanding and segmentation tasks, potentially influencing future developments in AI where model adaptability and training overhead reduction are key objectives.

Future research could explore enhancements to account for multi-view video data or integrate additional temporal cues like audio-visual synchronization to further refine action segmentation capabilities across diverse video contexts.