LXMERT: Learning Cross-Modality Encoder Representations from Transformers (1908.07490v3)

Published 20 Aug 2019 in cs.CL, cs.CV, and cs.LG

Abstract: Vision-and-language reasoning requires an understanding of visual concepts, language semantics, and, most importantly, the alignment and relationships between these two modalities. We thus propose the LXMERT (Learning Cross-Modality Encoder Representations from Transformers) framework to learn these vision-and-language connections. In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. Next, to endow our model with the capability of connecting vision and language semantics, we pre-train the model with large amounts of image-and-sentence pairs, via five diverse representative pre-training tasks: masked LLMing, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help in learning both intra-modality and cross-modality relationships. After fine-tuning from our pre-trained parameters, our model achieves the state-of-the-art results on two visual question answering datasets (i.e., VQA and GQA). We also show the generalizability of our pre-trained cross-modality model by adapting it to a challenging visual-reasoning task, NLVR2, and improve the previous best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablation studies to prove that both our novel model components and pre-training strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders. Code and pre-trained models publicly available at: https://github.com/airsplay/lxmert

PDF Abstract

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

The paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers," authored by Hao Tan and Mohit Bansal, presents a robust framework for vision-and-language reasoning tasks. The proposed LXMERT model leverages Transformer architecture to effectively learn the intricate relationships between visual concepts and language semantics. This is achieved through a sophisticated design comprising three Transformer-based encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Abstract

LXMERT is designed to facilitate enhanced vision-and-language understanding by connecting visual and linguistic data. The framework undergoes extensive pre-training on image-and-sentence pairs across five innovative tasks: masked LLMing, masked object prediction (both feature regression and label classification), cross-modality matching, and image question answering. This multi-task pre-training paradigm empowers LXMERT to capture both intra-modality and cross-modality relationships effectively, leading to state-of-the-art performances in visual question answering (VQA) and visual reasoning datasets. The generalizability of LXMERT is also validated through impressive gains on the $\text{NLVR}^2$ dataset.

Model Architecture

LXMERT's architecture comprises hierarchical layers to process and integrate visual and textual information. Each image is decomposed into objects and each sentence into words, forming the fundamental units for processing.

Input Embeddings

The input layer generates two sequences of embeddings: word-level sentence embeddings and object-level image embeddings. For sentences, words are tokenized and mapped to corresponding word embeddings, whereas objects are detected and represented using positional and feature embeddings derived from a pre-trained object detector.

Encoders

Single-Modality Encoders: Separate Transformer-based encoders for language and vision are employed to capture modality-specific features.
Cross-Modality Encoder: This encoder integrates information from both modalities using multi-head attention mechanisms. Self-attention layers capture intra-modality relationships, while cross-attention layers handle cross-modality alignments.

Pre-Training Strategies

LXMERT incorporates five pre-training tasks designed to embed a comprehensive understanding of both visual and linguistic domains:

Masked Cross-Modality LLMing: Randomly masked words in a sentence are predicted, utilizing visual context to resolve ambiguities.
Masked Object Prediction: Objects in images are masked and predicted through feature regression and detected-label classification.
Cross-Modality Matching: A classifier learns to discern whether an image and sentence pair is correctly matched.
Image Question Answering: The model is tasked with answering questions about images, further enhancing cross-modality comprehension.

Empirical Results

LXMERT demonstrates considerable empirical success across multiple datasets:

VQA and GQA Datasets: Achieves state-of-the-art accuracies of 72.5% and 60.3%, respectively. These results highlight the robustness of the framework in handling diverse visual question answering scenarios.
$\text{NLVR}^2$ Dataset: Notably improves the best-known accuracy by 22%, underlining the model's ability to generalize effectively to complex visual reasoning tasks.

Analysis

The paper includes detailed ablation studies, comparing LXMERT against various configurations and pre-training setups. Key findings include:

LXMERT's multi-task pre-training significantly outperforms models pre-trained with language-only tasks.
Integrating positional embeddings and relational understanding of visual objects critically enhances performance.
Vision-related pre-training tasks like masked object prediction contribute substantially to the model's capability.

Future Work and Implications

The research suggests that further improvements could be pursued by incorporating additional pre-training tasks to better capture noun-verb relationships in cross-modality contexts. Moreover, there is potential to explore more nuanced integration techniques within the cross-modality encoder to further refine feature alignment between modalities.

Conclusion

LXMERT establishes itself as an effective framework for vision-and-language reasoning, leveraging a detailed multi-encoder architecture and rigorous multi-task pre-training approach. Its success across multiple challenging datasets underscores its potential as a foundational model for future advancements in AI-driven visual and linguistic integration tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Hao Tan (80 papers)
Mohit Bansal (304 papers)

Citations (2,312)

View on Semantic Scholar