ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation (2308.00400v2)

Published 1 Aug 2023 in cs.CL and cs.MM

Abstract: Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.

Authors (5)

Bo Zhang (633 papers)
Jian Wang (967 papers)
Hui Ma (87 papers)
Bo Xu (212 papers)
Hongfei Lin (34 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - zhangbo-nlp/ZRIGF: [ACM MM 2023] ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation (2 stars)

ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation (2308.00400v2)

Summary

Related Papers

GitHub