HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

Published 1 May 2020 in cs.CV, cs.CL, and cs.LG | (2005.00200v2)

Abstract: We present HERO, a novel framework for large-scale video+language omni-representation learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. In addition to standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions. Comprehensive experiments demonstrate that HERO achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains. We also introduce two new challenging benchmarks How2QA and How2R for Video QA and Retrieval, collected from diverse video content over multimodalities.

Abstract PDF Upgrade to Chat

Citations (467)

View on Semantic Scholar

Summary

The paper presents a hierarchical encoder that integrates video and language data via cross-modal and temporal transformers to significantly improve alignment and contextual understanding.
The methodology employs four pre-training tasks (MLM, MFM, VSM, FOM) that enable detailed temporal sequencing and fine-grained correlation between video and subtitles.
Empirical results on diverse datasets reveal marked improvements in retrieval, question answering, inference, and captioning, establishing new state-of-the-art benchmarks.

Hierarchical Encoder for Video-Language Pre-training: A Detailed Examination of Hero

The paper under review discusses Hero, a hierarchical framework designed for large-scale video and language omni-representation pre-training. This framework utilizes a hierarchical encoder architecture to enhance multimodal representation learning through video and language integration. Hero’s innovative design and its comprehensive evaluation demonstrate significant advancements in several video-language tasks.

Model Architecture

Hero is structured around a two-level transformer architecture. It introduces the Cross-modal Transformer, which aligns subtitles with corresponding video frames, followed by a Temporal Transformer that captures the global context of the entire video clip. This hierarchical approach allows for a more nuanced understanding of temporal and contextual information within videos compared to previous BERT-like models with flat architectures.

Pre-training Tasks

The paper proposes four pre-training tasks: Masked Language Modeling (MLM), Masked Frame Modeling (MFM), Video-Subtitle Matching (VSM), and Frame Order Modeling (FOM). Among these, VSM and FOM are pivotal as they enhance the model's ability to align temporal information between video frames and subtitle text. VSM focuses on both global and local alignment of video and subtitle pairs, while FOM leverages the sequential nature of videos to predict the original order of shuffled frames.

Datasets and Evaluation

Hero is trained on extensive datasets, including HowTo100M and a large-scale TV dataset, which encompass diverse video content across instructional and entertainment domains. These datasets allow Hero to capture a broad range of visual and textual information, beneficial for complex tasks requiring comprehensive social and contextual understanding.

The evaluation demonstrates Hero's superior performance across several benchmarks: Text-based Video/Video-moment Retrieval, Video Question Answering, Video-and-language Inference, and Video Captioning. This is evident in Hero's state-of-the-art results in TVR, TVQA, How2R, and How2QA tasks.

Key Numerical Results

The empirical results validate Hero's design choices. For instance, the incorporation of VSM and FOM significantly increases performance metrics such as R@1 and R@10 on various datasets, indicating strong alignment and temporal understanding capabilities. The paper reports that pre-training with these tasks markedly boosts performance in downstream video-language applications.

Practical and Theoretical Implications

Practically, Hero's architecture can be extended to many video-language processing applications that require robust handling of multimodal data. Theoretically, it opens avenues for further exploration of hierarchical models in multimodal representation, illustrating that temporal and spatial context alignment can enhance learning in multimodal systems.

Future Directions

The paper suggests further exploration of region-level video representations and expanding the model's design to accommodate tasks like region-focused video reasoning. Evaluating Hero in broader contexts beyond instructional and TV datasets could further validate its effectiveness and adaptability.

In summary, Hero represents a significant advancement in video-language pre-training through its hierarchical architecture and innovative pre-training strategies. Its ability to effectively integrate and contextualize video and language data sets a new benchmark in the field, paving the way for future developments in multimodal AI research.

Markdown