VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset (2305.18500v2)

Published 29 May 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and eess.AS

Abstract: Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf LLM to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

PDF Abstract

The paper "VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset" (Chen et al., 2023 ) introduces a new large-scale dataset, VAST-27M, and a foundation model, VAST, designed to improve video understanding by incorporating vision, audio, and subtitle modalities alongside text. Existing video-text models often focus solely on vision and text, neglecting other valuable information streams present in videos like audio (environmental sounds, speech) and subtitles (transcriptions, dialogue). VAST aims to bridge this gap by enabling a more comprehensive understanding of video content through large-scale pretraining on multi-modal data.

The core contribution is the VAST-27M dataset, which comprises 27 million video clips, each paired with vision, audio, subtitle, and a novel omni-modality caption. The dataset is automatically generated using a two-stage pipeline. First, separate vision and audio captioners are trained on existing single-modality datasets. These captioners are then used to generate descriptions for the visual and audio content of video clips sampled from the HD_VILA_100M dataset [xue2022advancing]. In the second stage, an off-the-shelf LLM, specifically Vicuna-13b [vicuna2023], acts as an omni-modality captioner. It takes the generated single-modality captions, raw subtitles, and instructional prompts as input and synthesizes them into a single, coherent omni-modality caption. This automated generation process allows for scaling the dataset creation efficiently compared to manual annotation, addressing the challenge of obtaining large-scale multi-modal video-text data. VAST-27M provides 11 captions per video clip (5 vision, 5 audio, 1 omni-modality), totaling 297 million captions, significantly larger and more diverse in modality coverage than prior corpora like HowTo100M miech2019howto100m or WebVid bain2021frozen. The automatically generated vision and audio captions are shown to be of higher quality for training modality-specific tasks compared to raw subtitles or existing auto-generated datasets like WavCaps [mei2023wavcaps] when evaluated on downstream tasks.

Based on this dataset, the VAST foundation model is trained. VAST utilizes a Transformer-based architecture consisting of three encoders: a vision encoder (initialized from EVAClip-ViT-G [sun2023eva]), an audio encoder (initialized from BEATs [chen2022beats]), and a text encoder (initialized from BERT-B [devlin2018bert]). The text encoder also incorporates cross-attention mechanisms to fuse information from different modalities. The model processes raw video frames, audio spectrograms, and tokenized subtitles and captions. For pretraining, VAST is trained with three objectives:

Omni-Modality Video-Caption Contrastive Loss (OM-VCC): This objective aligns the global representations of the omni-modal video (concatenation of subtitle, vision, and audio global features) and the omni-modality caption in a shared semantic space using a contrastive loss, similar to CLIP [radford2021learning].

$\mathcal{L}_{\mathrm{OM-VCC}}=-\frac{1}{2} \sum_{i=1}^{B} \log \frac{\exp( \tau \cdot sim(v_{ i}, c_{i}))}{\sum_{j=1}^{B} \exp( \tau \cdot sim(v_{ i}, c_{j})))}-\frac{1}{2} \sum_{i=1}^{B} \log \frac{\exp( \tau \cdot sim(v_{ i}, c_{i})))}{\sum_{j=1}^{B} \exp( \tau \cdot sim(v_{ j}, c_{i})))}$

where $v_i$ is the $i$ -th omni-modal video global representation, $c_i$ is the $i$ -th omni-modality caption global representation, $B$ is batch size, $sim$ is dot product, and $\tau$ is a learnable temperature.
Omni-Modality Video-Caption Matching Loss (OM-VCM): This binary classification task predicts if a given video and caption pair is matched. It utilizes cross-attention where the text encoder attends to the concatenated unpooled features of all modalities, providing fine-grained interaction. Hard negative mining is used to create challenging negative pairs.

$\mathcal{L}_{\mathrm{OM-VCM}=\mathbb{E}_{\left(v_{i}, c_{i}\right) \sim(\mathcal{V}, \mathcal{C})}\left[y \log p_{vcm}+\left(1-y\right) \log \left(1-p_{vcm}\right)\right]}$

where $y=1$ for a matched pair and 0 otherwise, and $p_{vcm}$ is the predicted probability.
Omni-Modality Video Caption Generation Loss (OM-VCG): This is a conditional masked LLMing task. The model predicts masked tokens in the omni-modality caption conditioned on the video features (via cross-attention) and unmasked caption tokens (via causal self-attention).

$\mathcal{L}_{\mathrm{OM-VCG}=-\mathbb{E}_{\left(v_{i}, c_{i}\right) \sim(\mathcal{V}, \mathcal{C})} \log P\left(c_{m} \mid c_{<m}, v\right)}$

where $c_m$ are masked tokens and $c_{<m}$ are tokens before the masked ones.

To handle scenarios where some modalities might be absent in downstream tasks, VAST incorporates a "Modality Grouping" strategy during training. This involves training on relations between text and various combinations of video modalities, including V-T, A-T, VA-T, VS-T, and VAS-T, using corresponding vision, audio, and omni-modality captions. This ensures robustness and generalizability across different task setups. The overall loss is a weighted sum of losses for all considered modality groups and objectives.

For practical implementation, VAST has 1.3B parameters. It was trained on 64 Tesla V100 GPUs for 205K steps on a combined corpus including VAST-27M, VALOR-1M [chen2023valor], WavCaps [mei2023wavcaps], and large-scale image-text datasets (CC14M, LAION-400M subsets). Image-text data captions were replaced with those generated by the vision captioner for consistency. Specific hyperparameters for downstream task fine-tuning are provided in the appendix, including learning rates, batch sizes, epochs, and the number of video frames and audio clips sampled.

VAST demonstrates strong performance across a wide range of downstream benchmarks, achieving state-of-the-art results on 22 benchmarks covering vision-text, audio-text, and multi-modal video-text tasks like retrieval, captioning, and question answering. Notably, it shows significant improvements on audio-text benchmarks due to the scale and quality of audio captions in VAST-27M and excels on multi-modal benchmarks requiring reasoning over audio and/or subtitles, such as YouCook2 [zhou2018towards], VALOR-32K [chen2023valor], and TVC [lei2020tvr]. Ablation studies confirm the benefit of the omni-modality captions generated by the LLM and the effectiveness of multi-modal pretraining and the modality grouping strategy for improving performance and generalization.

The practical applications of VAST are broad, including enhanced video search and recommendation (e.g., finding videos based on sounds or dialogue), automatic generation of detailed video descriptions for accessibility or content summarization, and more robust video question answering systems that can leverage all available information streams. Developers can leverage the pre-trained VAST model as a foundation for building applications requiring complex video understanding. Fine-tuning on specific domain datasets can adapt VAST to tasks like educational video analysis, surveillance footage interpretation, or entertainment content processing. Computational requirements for training are substantial (large-scale distributed training), but inference and fine-tuning on smaller datasets would be more manageable, depending on the model size and the number of modalities used. Potential limitations include the dependency on the quality of auto-generated captions and subtitles, which might inherit biases from the source data and LLM, and the need for further integration with larger LLMs for more advanced generation capabilities. The release of the dataset and model aims to facilitate further research in omni-modal video understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Sihan Chen (39 papers)
Handong Li (13 papers)
Qunbo Wang (5 papers)
Zijia Zhao (17 papers)
Mingzhen Sun (10 papers)
Xinxin Zhu (21 papers)
Jing Liu (525 papers)

Citations (67)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - TXH-mercury/VAST: Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset (202 stars)