TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification (2312.14149v4)
Abstract: The crux of learning vision-LLMs is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-LLMs accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.
- Coco-stuff: Thing and stuff classes in context. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1209–1218, 2018.
- MixReorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In Int. Conf. Comput. Vis., pages 1196–1205, 2023.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11165–11174, 2023.
- Exploring open-vocabulary semantic segmentation from CLIP vision encoder distillation only. In Int. Conf. Comput. Vis., pages 699–710, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- The cityscapes dataset for semantic urban scene understanding. In IEEE Trans. Pattern Anal. Mach. Intell., pages 3213–3223, 2016.
- The pascal visual object classes (voc) challenge. Int. J. Comput. Vis., 88:303–338, 2010.
- Large-scale unsupervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 2022.
- Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304, 2010.
- Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9729–9738, 2020.
- Inject semantic concepts into image tagging for open-set recognition. arXiv preprint arXiv:2310.15200, 2023a.
- Tag2Text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023b.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Int. Conf. Mach. Learn., pages 4904–4916, 2021.
- Learning visual features from large weakly supervised data. In Eur. Conf. Comput. Vis., pages 67–84, 2016.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis., pages 1956–1981, 2020.
- Grounded language-image pre-training. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10965–10975, 2022.
- CLIP surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
- Focal loss for dense object detection. In Int. Conf. Comput. Vis., pages 2980–2988, 2017.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Eur. Conf. Comput. Vis., pages 275–292. Springer, 2022.
- NLTK: The natural language toolkit. arXiv preprint cs/0205028, 2002.
- SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In Int. Conf. Mach. Learn., pages 23033–23044. PMLR, 2023.
- Exploring the limits of weakly supervised pretraining. In Eur. Conf. Comput. Vis., pages 181–196, 2018.
- Generation and comprehension of unambiguous object descriptions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11–20, 2016.
- The role of context for object detection and semantic segmentation in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., pages 891–898, 2014.
- Filtering, distillation, and hard negatives for vision-language pre-training. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6967–6977, 2023.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pages 8748–8763, 2021.
- Perceptual grouping in contrastive vision-language models. In Int. Conf. Comput. Vis., pages 5571–5584, 2023.
- Balanced meta-softmax for long-tailed visual recognition. In Adv. Neural Inform. Process. Syst., pages 4175–4186, 2020.
- ViewCo: Discovering text-supervised segmentation masks via multi-view semantic consistency. In Int. Conf. Learn. Represent., 2022.
- Asymmetric loss for multi-label classification. In Int. Conf. Comput. Vis., pages 82–91, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- ReCo: Retrieve and co-segment for zero-shot transfer. Adv. Neural Inform. Process. Syst., pages 33754–33767, 2022.
- Revisiting weakly supervised pre-training of visual perception models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 804–814, 2022.
- Revisiting unreasonable effectiveness of data in deep learning era. In Int. Conf. Comput. Vis., pages 843–852, 2017.
- SAM-CLIP: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023.
- A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP, pages 31–35, 2019.
- MOFI: Learning image representations from noisy entity annotated images. arXiv preprint arXiv:2306.07952, 2023.
- Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In Adv. Neural Inform. Process. Syst., 2023.
- GroupViT: Semantic segmentation emerges from text supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18134–18144, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2935–2944, 2023.
- A simple framework for text-supervised semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7071–7080, 2023.
- Modeling context in referring expressions. In Eur. Conf. Comput. Vis., pages 69–85, 2016.
- Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. In Adv. Neural Inform. Process. Syst., 2023a.
- Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
- Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 2023c.
- Zhuowen Tu Zheng Ding, Jieke Wang. Open-vocabulary universal image segmentation with maskCLIP. In Int. Conf. Mach. Learn., 2023.
- Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis., 127:302–321, 2019.