Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers (2004.00849v2)

Published 2 Apr 2020 in cs.CV, cs.CL, cs.LG, and cs.MM

Abstract: We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks. Our Pixel-BERT which aligns semantic connection in pixel and text level solves the limitation of task-specific visual representation for vision and language tasks. It also relieves the cost of bounding box annotations and overcomes the unbalance between semantic labels in visual task and language semantic. To provide a better representation for down-stream tasks, we pre-train a universal end-to-end model with image and sentence pairs from Visual Genome dataset and MS-COCO dataset. We propose to use a random pixel sampling mechanism to enhance the robustness of visual representation and to apply the Masked LLM and Image-Text Matching as pre-training tasks. Extensive experiments on downstream tasks with our pre-trained model show that our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR). Particularly, we boost the performance of a single model in VQA task by 2.17 points compared with SOTA under fair comparison.

PDF Abstract

Overview of Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

The paper "Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers" presents a novel approach in the domain of vision-and-language representation learning. The authors introduce Pixel-BERT, an end-to-end framework utilizing a CNN-based visual encoder integrated with multi-modal transformers to enhance the alignment of image pixels with textual semantics. This architecture directly addresses the limitations of region-based visual features typically extracted via object detection models for joint visual-textual tasks.

Conceptual Framework

Pixel-BERT leverages the capabilities of BERT-like architectures in processing textual information while moving beyond traditional region-based features by incorporating pixel-level attention mechanisms. The proposed framework integrates a deep convolutional neural network (CNN) to extract pixel features from images without the dependency on bounding box annotations, bypassing the task-specific restrictions characteristic of object detection models.

Methodological Innovations

A significant contribution of Pixel-BERT is its utilization of pixel-level features in conjunction with multi-modal transformers. This approach allows for a seamless combination of dense visual and language embedding learning, improving the semantic gap present in prior models that rely on pre-defined region features. The authors enhance the robustness of their visual representations through a novel random pixel sampling mechanism, specifically designed to prevent overfitting by sampling a subset of pixel features during the pre-training phase.

The model is pre-trained on substantial datasets, such as Visual Genome and MS-COCO, using a combination of self-supervised learning tasks: Masked LLMing (MLM) for predicting masked tokens within the contextual embedding space, and Image-Text Matching (ITM) to discern the relevance of image-sentence pairs.

Experimental Results

Extensive experimental evaluations demonstrate that Pixel-BERT achieves competitive or superior performance on a variety of vision-and-language tasks compared to existing models. For instance, the model shows a notable performance uplift of 2.17 points on the Visual Question Answering (VQA) task relative to state-of-the-art benchmarks. Additionally, tasks such as image-text retrieval and natural language visual reasoning (NLVR $^2$ ) benefit from Pixel-BERT's pixel-aligned learning approach, achieving state-of-the-art results.

Implications and Potential Developments

The integration of pixel-level feature representation paves the way for addressing broader semantic interpretations intrinsic to visual data, elements often lost in task-specific models. As such, Pixel-BERT circumvents the semantic limitations inherent to region-based features, potentially enabling improved model generalization across diverse visual and linguistic contexts.

Future work could explore the scalability of Pixel-BERT by incorporating larger datasets, such as Conceptual Captions, to refine its alignment capabilities. Additionally, advancing methods to integrate masked visual prediction tasks within this framework could further bolster the cross-modal understanding, enhancing downstream capabilities and robustness.

To conclude, Pixel-BERT introduces a promising direction in the field of vision-and-LLMs by re-evaluating the fundamentals of visual feature embedding. Through pixel-level alignment and comprehensive pre-training strategies, Pixel-BERT contributes significantly to the ongoing enhancement of multi-modal representations within the field of artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Zhicheng Huang (9 papers)
Zhaoyang Zeng (29 papers)
Bei Liu (63 papers)
Dongmei Fu (19 papers)
Jianlong Fu (91 papers)

Citations (411)

View on Semantic Scholar