TransVG: End-to-End Visual Grounding with Transformers (2104.08541v4)

Published 17 Apr 2021 in cs.CV

Abstract: In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and make the code available at \url{https://github.com/djiajunustc/TransVG}.

PDF Abstract

Overview of "TransVG: End-to-End Visual Grounding with Transformers"

The paper "TransVG: End-to-End Visual Grounding with Transformers" presents a novel transformer-based approach to the visual grounding problem. Visual grounding involves locating an image region corresponding to a text query, a task commonly associated with referring expression comprehension and phrase localization. This paper introduces TransVG, a concise yet effective framework utilizing transformers to perform end-to-end visual grounding. The proposed method addresses limitations in traditional multi-modal fusion techniques by employing transformers to establish multi-modal correspondences, thereby eschewing the need for complex fusion mechanisms typically seen in existing two-stage and one-stage frameworks.

The authors identify that conventional approaches, often reliant on manually-designed modules like modular attention networks and dynamic graphs, may lead to overfitting and restricted interactions between visual and linguistic contexts due to their rigid structures. This limitation motivates the development of TransVG, which formulates the visual grounding challenge as a direct coordinates regression problem. This crucial shift from candidate-based predictions (via region proposals or anchor boxes) paves the way for a more straightforward and efficient model.

TransVG integrates a visual branch, a linguistic branch, and a visual-linguistic fusion module. The visual branch utilizes a convolutional backbone—ResNet—followed by a visual transformer to generate visual embeddings. Simultaneously, the linguistic branch processes the text query using a pre-trained BERT model. The embeddings derived from both branches are then fused via a transformer-based framework where intra- and inter-modal relations are established through self-attention mechanisms. The final coordinates for the referred object are directly obtained via regression, simplifying the grounding process.

Empirical Evaluation

The empirical evaluation on datasets like ReferItGame, Flickr30K Entities, RefCOCO, RefCOCO+, and RefCOCOg demonstrated significant advancements over prior works. Notably, TransVG achieved a top-1 accuracy of 70.73% on the ReferItGame test set and 79.10% on the Flickr30K Entities test set, marking considerable improvements over existing state-of-the-art methods. These results substantiate the efficacy of replacing complex fusion modules with transformer encoder layers, allowing a more flexible, scalable, and effective solution to visual grounding challenges.

Theoretical and Practical Implications

The implications of this research are manifold, both theoretically and practically. On a theoretical level, the paper underscores the potential of transformers to handle multi-modal context reasoning in a homogeneous manner, without the need for predefined fusion structures. This points towards a paradigm shift where the emphasis is on flexible attention-based architectures over rigid predefined mechanisms. Practically, TransVG's simplified architecture can enhance computational efficiency and model versatility, potentially leading to broader applicability in real-world interfaces where seamless integration of visual and linguistic inputs is critical.

Future Directions

Building upon this work, future research may experiment with varying transformer model configurations, such as adjusting the number of layers or exploring more lightweight transformer variants. Additionally, further exploration into the scalability of this approach on even larger and more diverse datasets could corroborate its robustness across different visual grounding scenarios. Further, integrating this framework with broader vision-language tasks, potentially including dynamic scene understanding, virtual assistance, and robotics applications, could yield comprehensive multi-modal AI systems.

In conclusion, TransVG represents an important step in employing transformers for visual grounding tasks, achieving superior performance while maintaining architectural intrigue. Its impact on the domain encourages ongoing exploration of transformer-based models in multi-modal AI challenges.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jiajun Deng (75 papers)
Zhengyuan Yang (86 papers)
Tianlang Chen (24 papers)
Wengang Zhou (153 papers)
Houqiang Li (236 papers)

Citations (296)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - djiajunustc/TransVG (186 stars)