VD-BERT: A Unified Vision and Dialog Transformer with BERT (2004.13278v3)

Published 28 Apr 2020 in cs.CV and cs.CL

Abstract: Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT LLMs for Visual Dialog tasks. The model is unified in that (1) it captures all the interactions between the image and the multi-turn dialog using a single-stream Transformer encoder, and (2) it supports both answer ranking and answer generation seamlessly through the same architecture. More crucially, we adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Without the need of pretraining on external vision-language data, our model yields new state of the art, achieving the top position in both single-model and ensemble settings (74.54 and 75.35 NDCG scores) on the visual dialog leaderboard. Our code and pretrained models are released at https://github.com/salesforce/VD-BERT.

PDF Abstract

An Essay on VD-BERT: A Unified Vision and Dialog Transformer with BERT

The paper introduces VD-BERT, a novel framework that unifies vision and dialog tasks through a Transformer architecture leveraging pretrained BERT models. VD-BERT addresses the Visual Dialog (VisDial) challenge, wherein an AI agent must answer a series of questions based on both image content and dialog history. Unlike traditional Visual Question Answering (VQA) tasks, VisDial necessitates continuous interaction over multiple conversational turns, demanding a sophisticated approach to integrate vision and dialog elements.

Key Contributions and Architecture

VD-BERT distinguishes itself by adopting a single-stream Transformer encoder capable of modeling interactions between both image and multi-turn dialog, encapsulated in a unified manner to support answer ranking and generation within the same architecture. This integration employs bidirectional attention mechanisms, allowing entities (image regions, fragments of text, etc.) to act as both information seekers and providers.

A significant aspect of VD-BERT’s architecture is its visually grounded training objectives. These include Masked LLMing (MLM) and Next Sentence Prediction (NSP), adapted to incorporate visual features, which effectively facilitate the fusion of visual and dialog contents. Notably, VD-BERT achieves state-of-the-art results without relying on pretraining with external vision-language datasets, underscoring the efficacy of its architecture.

VD-BERT’s approach to utilizing BERT for multimodal task adaptation demonstrates how pretrained LLMs can be extended to complex vision-language tasks through relatively straightforward modifications. This contributes to the ongoing discourse on the flexibility and adaptability of Transformer-based models in various AI domains.

Experimental Results

The experimental results underscore VD-BERT’s impressive performance, establishing new benchmarks in visual dialog tasks. It exhibits robust performance in both discriminative and generative settings, showcasing its versatility across different evaluation metrics, such as Recall@K, MRR, and Mean Rank. Particularly, VD-BERT achieves remarkable scores on the Visual Dialog leaderboard, surpassing many preceding models in NDCG and other ranking-related metrics.

This success is attributed to VD-BERT’s innovative training methods, particularly its visually grounded MLM and NSP objectives, which enable the model to seamlessly support dual dialog training paradigms without formal decoders.

Implications and Future Directions

The practical implications of VD-BERT are profound. Its ability to model detailed interactions between image and dialog history could enhance AI systems in customer service, human-computer interaction, and education, where contextual understanding of visual and textual inputs is crucial. Theoretically, the paper reinforces the potential of leveraging pretrained LLMs beyond linguistic tasks to encompass multimodal applications.

Future research could explore integrating larger-scale pretrained models and more diverse datasets to further generalize VD-BERT’s capabilities. Expanding this unified framework to other vision-language tasks, like video dialog or interactive storytelling, may also offer promising avenues for advancing AI comprehension and reasoning.

In conclusion, VD-BERT exemplifies the potential of Transformer-based architectures in pushing the boundaries of AI’s ability to engage in complex multimodal dialog tasks. Its innovative approach for vision-dialog integration highlights a significant stride in AI research, paving the way for further exploration in this dynamic field.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yue Wang (675 papers)
Shafiq Joty (187 papers)
Michael R. Lyu (176 papers)
Irwin King (170 papers)
Caiming Xiong (337 papers)
Steven C. H. Hoi (94 papers)

Citations (97)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - salesforce/VD-BERT (44 stars)