Multi-method Integration with Confidence-based Weighting for Zero-shot Image Classification (2405.02155v1)
Abstract: This paper introduces a novel framework for zero-shot learning (ZSL), i.e., to recognize new categories that are unseen during training, by using a multi-model and multi-alignment integration method. Specifically, we propose three strategies to enhance the model's performance to handle ZSL: 1) Utilizing the extensive knowledge of ChatGPT and the powerful image generation capabilities of DALL-E to create reference images that can precisely describe unseen categories and classification boundaries, thereby alleviating the information bottleneck issue; 2) Integrating the results of text-image alignment and image-image alignment from CLIP, along with the image-image alignment results from DINO, to achieve more accurate predictions; 3) Introducing an adaptive weighting mechanism based on confidence levels to aggregate the outcomes from different prediction methods. Experimental results on multiple datasets, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our model can significantly improve classification accuracy compared to single-model approaches, achieving AUROC scores above 96% across all test datasets, and notably surpassing 99% on the CIFAR-10 dataset.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Zest: Zero-shot learning from text descriptions using textual similarity and visual summarization. arXiv preprint arXiv:2010.03276, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Multi-label zero-shot learning with graph convolutional networks. Neural Networks, 132:333–341, 2020.
- J. Gao and C. S. Xu. Ci-gnn: Building a category-instance graph for zero-shot video classification. IEEE Transactions on Multimedia, 22(12):3088–3100, 2020.
- S. Sankaranarayanan and Y. Balaji. Meta learning for domain generalization. In Meta Learning With Medical Imaging and Health Informatics Applications, pages 75–86. Elsevier, 2023.
- Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018.
- Generalized zero-shot learning with deep calibration network. Advances in neural information processing systems, 31, 2018.
- Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175–4186, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 769–778, 2023.
- Image-free classifier injection for zero-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19072–19081, 2023.
- Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR, 2023.
- Multimodal fake news detection via clip-guided learning. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 2825–2830. IEEE, 2023.
- N. K. Lahajal et al. Enhancing image retrieval: A comprehensive study on photo search using the clip mode. arXiv preprint arXiv:2401.13613, 2024.
- Extending clip for category-to-image retrieval in e-commerce. In European Conference on Information Retrieval, pages 289–303. Springer, 2022.
- Vita-clip: Video and text adaptive clip via multimodal prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23034–23044, 2023.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Improving language understanding by generative pre-training. 2018.
- J. A. Baktash and M. Dawodi. Gpt-4: A review on advancements and opportunities in natural language processing. arXiv preprint arXiv:2305.03195, 2023.
- A comprehensive study of chatgpt: Advancements, limitations and ethical considerations in natural language processing and cybersecurity. Information, 14(8):462, 2023.
- C. E. Haupt and M. Marks. Ai-generated medical advice—gpt and beyond. JAMA, 329(16):1349–1350, 2023.
- Can gpt-4v (ision) serve medical applications? case studies on gpt-4v for multimodal medical diagnosis. arXiv preprint arXiv:2310.09909, 2023.
- J. J. Huallpa et al. Exploring the ethical considerations of using chat gpt in university education. Periodicals of Engineering and Natural Sciences, 11(4):105–115, 2023.
- Gemini pro defeated by gpt-4v: Evidence from education. arXiv preprint arXiv:2401.08660, 2023.
- A. M. Perlman. The implications of chatgpt for legal services and society. Available at SSRN 4294197, 2022.
- Dall-e: Creating images from text. UGC Care Group I Journal, 8(14):71–75, 2021.
- N. Rane. Role and challenges of chatgpt and similar generative artificial intelligence in arts and humanities. Available at SSRN 4603208, 2023.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6023–6032, 2019.
- Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33:6256–6268, 2020.
- Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15211–15222, 2023.
- Learning open set network with discriminative reciprocal points. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, pages 507–522. Springer, 2020.
- Hybrid models for open set recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III, pages 102–117. Springer, 2020.
- Pmal: Open set recognition via robust prototype mining. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1872–1880, 2022.
- Zero-shot out-of-distribution detection based on the pre-trained model clip. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6568–6576, 2022.
- Open-set recognition: A good closed-set classifier is all you need? 2021.
- W. Cho and J. Choo. Towards accurate open-set recognition via background-class regularization. In European Conference on Computer Vision, pages 658–674. Springer, 2022.
- Class-specific semantic reconstruction for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4214–4228, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.