Grafting Pre-trained Models for Multimodal Headline Generation (2211.07210v1)
Abstract: Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Due to a lack of large-scale, manually annotated data, the task of annotating grounded headlines for video is labor intensive and impractical. Previous researches on pre-trained LLMs and video-LLMs have achieved significant progress in related downstream tasks. However, none of them can be directly applied to multimodal headline architecture where we need both multimodal encoder and sentence decoder. A major challenge in simply gluing LLM and video-LLM is the modality balance, which is aimed at combining visual-language complementary abilities. In this paper, we propose a novel approach to graft the video encoder from the pre-trained video-LLM on the generative pre-trained LLM. We also present a consensus fusion mechanism for the integration of different components, via inter/intra modality relation. Empirically, experiments show that the grafted model achieves strong results on a brand-new dataset collected from real-world applications.
- Lingfeng Qiao (8 papers)
- Chen Wu (169 papers)
- Ye Liu (153 papers)
- Haoyuan Peng (3 papers)
- Di Yin (26 papers)
- Bo Ren (60 papers)