Generic Event Boundary Detection: A Benchmark for Event Segmentation (2101.10511v5)

Published 26 Jan 2021 in cs.CV

Abstract: This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks. Conventional work in temporal video segmentation and action detection focuses on localizing pre-defined action categories and thus does not scale to generic videos. Cognitive Science has known since last century that humans consistently segment videos into meaningful temporal chunks. This segmentation happens naturally, without pre-defined event categories and without being explicitly asked to do so. Here, we repeat these cognitive experiments on mainstream CV datasets; with our novel annotation guideline which addresses the complexities of taxonomy-free event boundary annotation, we introduce the task of Generic Event Boundary Detection (GEBD) and the new benchmark Kinetics-GEBD. Our Kinetics-GEBD has the largest number of boundaries (e.g. 32 of ActivityNet, 8 of EPIC-Kitchens-100) which are in-the-wild, taxonomy-free, cover generic event change, and respect human perception diversity. We view GEBD as an important stepping stone towards understanding the video as a whole, and believe it has been previously neglected due to a lack of proper task definition and annotations. Through experiment and human study we demonstrate the value of the annotations. Further, we benchmark supervised and un-supervised GEBD approaches on the TAPOS dataset and our Kinetics-GEBD. We release our annotations and baseline codes at CVPR'21 LOVEU Challenge: https://sites.google.com/view/loveucvpr21.

Authors (5)

Mike Zheng Shou (165 papers)
Stan Weixian Lei (6 papers)
Weiyao Wang (27 papers)
Deepti Ghadiyaram (23 papers)
Matt Feiszli (30 papers)

Citations (72)

View on Semantic Scholar

Summary

Generic Event Boundary Detection: A Benchmark for Event Segmentation

This paper introduces a novel task termed Generic Event Boundary Detection (GEBD), alongside a new benchmark named Kinetics-GEBD, aimed at identifying taxonomy-free event boundaries within video data. Traditionally, video temporal segmentation has been restricted by predefined action categories, which limits scalability across diverse video types. The GEBD task seeks to emulate humans' cognitive ability to segment videos into meaningful temporal units naturally, without reliance on predefined taxonomies.

Key Contributions

Novel Task Introduction: The paper defines GEBD, marking a significant shift from conventional video analysis that relies on predefined action categories. GEBD strives to detect naturally perceived event boundaries that demarcate different segments within a video based on cognitive research indicating humans' spontaneous segmentation behavior.
Benchmark Development: The authors present Kinetics-GEBD, the largest dataset targeting temporal boundaries in videos spanning various domains. Kinetics-GEBD leverages the Kinetics dataset and addresses the challenges of annotation through innovative guidelines that maintain annotation consistency across different annotators, while capturing diverse human perceptions.
Annotation Strategy: The paper highlights a unique annotation approach inspired by cognitive science experiments. Annotators are instructed to focus on dominant subjects and observe changes in action, subjects, objects, or environment, liberating them from predefined action classes. This method ensures richness and diversity in the temporal boundary annotations.
Comparison to Existing Benchmarks: The paper contrasts Kinetics-GEBD with other datasets such as ActivityNet and EPIC-Kitchens, noting its advantages in terms of annotation breadth and diversity. The dataset supports various applications, including video editing and summarization, offering broader implications for the understanding and reasoning of video content.
Experimental Validation: Comprehensive experiments evaluate both supervised and unsupervised GEBD methods on the Kinetics-GEBD and TAPOS datasets. Notably, they showcase strong numerical results, indicating unsupervised methods (e.g., Predictability Assessment) outperform several established supervised approaches on TAPOS, underscoring the feasibility of unsupervised event boundary detection.

Implications and Future Directions

The introduction of GEBD, as demonstrated through Kinetics-GEBD, has substantial theoretical and practical ramifications. On a theoretical level, the task advances the understanding of video segmentation by aligning computational models more closely with human cognitive processes. Practically, it opens paths for improved automation in video editing and summarization, enhancing applications such as keyframe selection and semantic video partitioning.

For future development in AI, the work suggests several avenues. The success of unsupervised methods in event detection implies potential for broader application across different domains, possibly reducing the annotation necessity in video datasets. Furthermore, extending the benchmark to integrate longer videos or scenario changes, such as scene transitions, could further refine temporal segmentation models and enable the automation of complex video understanding tasks.

In conclusion, this paper provides an essential stepping stone in rethinking video segmentation by introducing GEBD, a flexible and cognitively inspired task, backed by the robust and diverse Kinetics-GEBD benchmark. The impact of this work may lead to significant innovations in video analysis technologies, contributing to the advancement of machine understanding of complex video content.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos