Diverse and Tailored Image Generation for Zero-shot Multi-label Classification (2404.03144v1)
Abstract: Recently, zero-shot multi-label classification has garnered considerable attention for its capacity to operate predictions on unseen labels without human annotations. Nevertheless, prevailing approaches often use seen classes as imperfect proxies for unseen ones, resulting in suboptimal performance. Drawing inspiration from the success of text-to-image generation models in producing realistic images, we propose an innovative solution: generating synthetic data to construct a training set explicitly tailored for proxyless training on unseen labels. Our approach introduces a novel image generation framework that produces multi-label synthetic images of unseen classes for classifier training. To enhance diversity in the generated images, we leverage a pre-trained LLM to generate diverse prompts. Employing a pre-trained multi-modal CLIP model as a discriminator, we assess whether the generated images accurately represent the target classes. This enables automatic filtering of inaccurately generated images, preserving classifier accuracy. To refine text prompts for more precise and effective multi-label object generation, we introduce a CLIP score-based discriminative loss to fine-tune the text encoder in the diffusion model. Additionally, to enhance visual features on the target task while maintaining the generalization of original features and mitigating catastrophic forgetting resulting from fine-tuning the entire visual encoder, we propose a feature fusion module inspired by transformer attention mechanisms. This module aids in capturing global dependencies between multiple objects more effectively. Extensive experimental results validate the effectiveness of our approach, demonstrating significant improvements over state-of-the-art methods.
- Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38, 1425–1438.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466 .
- Semantic diversity learning for zero-shot multi-label classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 640–650.
- Multi-label image recognition with graph convolutional networks, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5177–5186.
- Nus-wide: a real-world web image database from national university of singapore, in: Proceedings of the ACM international conference on image and video retrieval, pp. 1–9.
- An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations.
- Transductive multi-label zero-shot learning, in: British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, BMVA Press.
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Open-vocabulary multi-label classification via multi-modal knowledge transfer, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 808–816.
- Lightvit: Towards light-weight convolution-free vision transformers. arXiv preprint arXiv:2207.05557 .
- Active generation for image classification. arXiv preprint arXiv:2403.06517 .
- Knowledge distillation from a stronger teacher. Advances in Neural Information Processing Systems 35, 33716–33727.
- A shared multi-attention framework for multi-label zero-shot learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8776–8786.
- Regularized discriminative broad learning system for image classification. Knowledge-Based Systems 251, 109306.
- Deep active learning models for imbalanced image classification. Knowledge-Based Systems 257, 109817.
- Bilinear attention networks. Advances in neural information processing systems 31.
- Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer. pp. 740–755.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022.
- A convnet for the 2020s, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986.
- Decoupled weight decay regularization, in: International Conference on Learning Representations.
- Camdiff: Camouflage image augmentation via diffusion model. arXiv preprint arXiv:2304.05469 .
- Discriminative region-based multi-label zero-shot learning, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8731–8740.
- Modular graph transformer networks for multi-label image classification, in: Proceedings of the AAAI conference on artificial intelligence, pp. 9092–9100.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models, in: International Conference on Machine Learning, PMLR. pp. 16784–16804.
- Zero-shot learning by convex combination of semantic embeddings, in: 2nd International Conference on Learning Representations, ICLR 2014.
- Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.
- Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR. pp. 8748–8763.
- Language models are unsupervised multitask learners. OpenAI blog 1, 9.
- Deep multiple instance learning for zero-shot image tagging, in: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part I 14, Springer. pp. 530–546.
- Zero-shot text-to-image generation, in: International Conference on Machine Learning, PMLR. pp. 8821–8831.
- Asymmetric loss for multi-label classification, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 82–91.
- Ml-decoder: Scalable and versatile classification head, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 32–41.
- High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494.
- Dualcoop: Fast adaptation to multi-label recognition with limited annotations. Advances in Neural Information Processing Systems 35, 30569–30582.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
- Effective data augmentation with diffusion models, in: ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
- A dual modality approach for (zero-shot) multi-label classification. arXiv preprint arXiv:2208.09562 .
- Multi-label knowledge distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17271–17280.
- Ttida: Controllable generative data augmentation via text-to-text and text-to-image models. arXiv preprint arXiv:2304.08821 .
- Cross-modality attention with semantic graph embedding for multi-label classification, in: Proceedings of the AAAI conference on artificial intelligence, pp. 12709–12716.
- Greedynas: Towards fast one-shot nas with greedy supernet, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999–2008.
- Positive label is all you need for multi-label classification. arXiv preprint arXiv:2306.16016 .
- Fast zero-shot image tagging, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 5985–5994.
- Deep semantic dictionary learning for multi-label image classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3572–3580.
- Kaixin Zhang (14 papers)
- Zhixiang Yuan (2 papers)
- Tao Huang (203 papers)