- The paper presents a novel automated method that extracts DVS narrations to build M-VAD, the largest video annotation dataset derived from 92 DVDs.
- It employs vocal isolation and noise cancellation techniques to accurately segment narrations with a temporal misalignment under two seconds.
- M-VAD comprises over 84.6 hours of video across 48,986 clips averaging 6.2 seconds each, offering a high-quality resource for deep learning research.
An Analysis of Creating a Large Video Annotation Dataset Using Descriptive Video Services
The paper introduces a novel dataset derived from Descriptive Video Service (DVS) narrations found on DVDs, which aims to serve as a significant resource for video annotation research. The authors developed an automated process for segmenting and aligning DVS audio tracks with corresponding video content, creating what they believe to be the largest DVS-derived dataset available. This collection, named the Montreal Video Annotation Dataset (M-VAD), comprises over 84.6 hours of paired video and descriptive text from 92 DVDs.
Methodological Approach
The core contribution of this work lies in the creation of a scalable method for dataset construction using DVS audio tracks, which serve as narrated descriptions of visual content tailored for the visually impaired. DVS differs from traditional movie scripts as it is tightly aligned with visual scenes, providing descriptions of actions, appearances, and other visual elements with a temporal misalignment typically not exceeding two seconds.
The authors implemented a semi-automated system to isolate and segment these narrations from the mixed movie soundtracks. They utilized vocal isolation techniques combined with Least Mean Square (LMS) noise cancellation to distinguish the DVS narration from the original soundtrack. This methodology capitalizes on the fact that DVS narrations are inserted in natural dialogue pauses, hence aiding in the clean extraction of narrations.
Dataset Characteristics and Comparison
The resultant M-VAD dataset boasts a vast number of video clips, totaling 48,986, with each clip averaging 6.2 seconds. This dataset is unique due to its reliance on professionally produced descriptions, as opposed to crowd-sourced descriptions found in many other datasets.
A detailed comparison with existing datasets highlights M-VAD's comprehensive scope:
- M-VAD covers a wide variety of movies and genres compared to cooking-centric datasets like TACoS.
- It surpasses previous efforts in size and the quality of its natural language descriptions owing to the professional nature of DVS.
The dataset's corpus has been analyzed using POS tagging to identify its linguistic features. The vocabulary includes a large number of nouns, verbs, and adjectives, indicative of the descriptive richness offered by DVS narrations.
Implications and Future Work
Practically, the M-VAD dataset holds significant promise for advancing video annotation models, particularly in deep learning contexts that require extensive paired data. The dataset's high-quality descriptions can facilitate more nuanced understanding and generation of natural language descriptions from visual data.
Theoretically, this research also posits questions about the comparative efficacy of different sources of video descriptions. By contrasting DVS with film scripts, it lays the groundwork for further studies on optimal annotation sources for machine learning tasks.
There is potential for extensions of this work, with possible directions including the enrichment of the dataset with additional DVDs as more become available, or improving the automation of the segmentation and transcription processes. Future research might also explore the domain adaptation of machine learning models trained on this dataset to related video understanding tasks.
In sum, this paper contributes a valuable resource and methodological insights to the video annotation and AI research community, fostering improvements in machine understanding of videos through richer datasets.