Exploring External Knowledge for Accurate modeling of Visual and Language Problems

Published 27 Jan 2023 in cs.CV and cs.CL | (2302.08901v1)

Abstract: The interest in AI and its applications has seen unprecedented growth in the last few years. The success can be partly attributed to the advancements of deep neural networks made in the sub-fields of AI such as Computer Vision (CV) and NLP. The promising research area that this dissertation focuses on is visual and language understanding which involves many challenging tasks, i.e., classification, detection, segmentation, machine translation and captioning, etc. The state-of-the-art methods for solving these problems usually involves only two parts: source data and target labels, which is rather insufficient especially when the dataset is small. Meanwhile, many external tools or sources can provide extra useful information (external knowledge) that can help improve the performance of these methods. For example, a detection model has been applied to provide better object features than state-of-the-art ResNet for image captioning models. Inspired by this observation, we developed a methodology that we can first extract external knowledge and then integrate it with the original models. The external knowledge has to be extracted from the dataset, or can directly come from external, e.g., grammar rules or scene graphs. We apply this methodology to different AI tasks, including machine translation and image captioning and improve the original state-of-the-art models by a large margin.