K-LITE: Learning Transferable Visual Models with External Knowledge (2204.09222v2)

Published 20 Apr 2022 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, due to the broad concept coverage achieved via large-scale data collection process. Alternatively, we argue that learning with external knowledge is a promising way which leverages a much more structured source of supervision and offers sample efficiency. We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts. In evaluation, the text is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is available at https://github.com/microsoft/klite.

PDF Abstract

K-Lite: Leveraging External Knowledge for Enhanced Transferability in Visual Models

This paper introduces K-Lite, a method for learning visual models that leverage external knowledge for improved transferability across tasks in computer vision. Traditional supervised learning approaches in computer vision are limited by their dependence on fixed concept sets, often resulting in models that are specialized for specific tasks but lack broad transferability to novel datasets with different concept sets. To address these limitations, the authors propose a strategy that enhances visual models with structured external knowledge, improving both zero-shot and few-shot learning capabilities.

Key Contributions

Knowledge Augmentation Strategy: K-Lite enhances both image classification and object detection models by integrating external knowledge from sources such as WordNet and Wiktionary. This augmentation is performed by enriching the entity representations in training data with knowledge components, which are then used in combination with the learned image representations during evaluation for zero-shot or few-shot tasks.
Task-Level Transfer Learning: The paper focuses on improving task-level transfer learning rather than class-level, demonstrating substantial improvements in transferring learned models to new datasets with unseen categories.
Empirical Validation: The paper presents extensive empirical results, benchmarking K-Lite on 20 image classification datasets and 13 object detection datasets. The results indicate that external knowledge significantly enhances model transferability, enabling efficient learning with fewer pre-training data samples compared to baseline models.
Modularized Architecture: To address potential inconsistencies between training and evaluation conditions due to incomplete knowledge bases, a modular approach using adapters is introduced. This ensures that models can toggle between knowledge-augmented and traditional modes, enhancing adaptability to varying downstream tasks.

Insights and Implications

Conceptual Overlap and Transfer Performance: An essential finding is that external knowledge bridges the gap between pre-training and evaluation datasets by increasing conceptual overlap. By using broad and commonly understood terms from knowledge sources, rare or unseen concepts during training can still benefit during evaluation.
Sample Efficiency: The integration of external knowledge not only improves performance but also enhances sample efficiency. K-Lite models demonstrate competitive performance while utilizing only a fraction of the training data required by prior models such as UniCL.
Challenges and Future Directions: The research identifies areas for future exploration, including improving the coverage and quality of external knowledge sources and better aligning these with specific tasks. Addressing knowledge sparsity and improving task-specific explanations remain open challenges.

In summary, K-Lite advances the state-of-the-art in visual model transferability by strategically incorporating structured external knowledge, achieving improved zero-shot and few-shot performance. The paper underscores the promise of enriching language-based visual supervision with external semantic structures, setting the stage for future developments in AI that seek to merge data efficiency with model adaptability.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Sheng Shen (68 papers)
Chunyuan Li (122 papers)
Xiaowei Hu (54 papers)
Jianwei Yang (93 papers)
Yujia Xie (29 papers)
Pengchuan Zhang (58 papers)
Zhe Gan (135 papers)
Lijuan Wang (133 papers)
Lu Yuan (130 papers)
Ce Liu (51 papers)
Kurt Keutzer (200 papers)
Trevor Darrell (324 papers)
Anna Rohrbach (53 papers)
Jianfeng Gao (344 papers)

Citations (78)

View on Semantic Scholar

K-LITE: Learning Transferable Visual Models with External Knowledge (2204.09222v2)

K-Lite: Leveraging External Knowledge for Enhanced Transferability in Visual Models

Key Contributions

Insights and Implications

Related Papers

GitHub

YouTube