Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification (2312.14149v4)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: The crux of learning vision-LLMs is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, e.g., the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (e.g., cat) and attributes (e.g., black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 5.2\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-LLMs accurately localize attribute-specified objects. Project page can be found at https://qinying-liu.github.io/Tag-Align.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Coco-stuff: Thing and stuff classes in context. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1209–1218, 2018.
  2. MixReorg: Cross-modal mixed patch reorganization is a good mask learner for open-world semantic segmentation. In Int. Conf. Comput. Vis., pages 1196–1205, 2023.
  3. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11165–11174, 2023.
  4. Exploring open-vocabulary semantic segmentation from CLIP vision encoder distillation only. In Int. Conf. Comput. Vis., pages 699–710, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  6. The cityscapes dataset for semantic urban scene understanding. In IEEE Trans. Pattern Anal. Mach. Intell., pages 3213–3223, 2016.
  7. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis., 88:303–338, 2010.
  8. Large-scale unsupervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 2022.
  9. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304, 2010.
  10. Momentum contrast for unsupervised visual representation learning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9729–9738, 2020.
  11. Inject semantic concepts into image tagging for open-set recognition. arXiv preprint arXiv:2310.15200, 2023a.
  12. Tag2Text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023b.
  13. Scaling up visual and vision-language representation learning with noisy text supervision. In Int. Conf. Mach. Learn., pages 4904–4916, 2021.
  14. Learning visual features from large weakly supervised data. In Eur. Conf. Comput. Vis., pages 67–84, 2016.
  15. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  16. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis., pages 1956–1981, 2020.
  17. Grounded language-image pre-training. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10965–10975, 2022.
  18. CLIP surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
  19. Focal loss for dense object detection. In Int. Conf. Comput. Vis., pages 2980–2988, 2017.
  20. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Eur. Conf. Comput. Vis., pages 275–292. Springer, 2022.
  21. NLTK: The natural language toolkit. arXiv preprint cs/0205028, 2002.
  22. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In Int. Conf. Mach. Learn., pages 23033–23044. PMLR, 2023.
  23. Exploring the limits of weakly supervised pretraining. In Eur. Conf. Comput. Vis., pages 181–196, 2018.
  24. Generation and comprehension of unambiguous object descriptions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11–20, 2016.
  25. The role of context for object detection and semantic segmentation in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., pages 891–898, 2014.
  26. Filtering, distillation, and hard negatives for vision-language pre-training. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6967–6977, 2023.
  27. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pages 8748–8763, 2021.
  28. Perceptual grouping in contrastive vision-language models. In Int. Conf. Comput. Vis., pages 5571–5584, 2023.
  29. Balanced meta-softmax for long-tailed visual recognition. In Adv. Neural Inform. Process. Syst., pages 4175–4186, 2020.
  30. ViewCo: Discovering text-supervised segmentation masks via multi-view semantic consistency. In Int. Conf. Learn. Represent., 2022.
  31. Asymmetric loss for multi-label classification. In Int. Conf. Comput. Vis., pages 82–91, 2021.
  32. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
  33. ReCo: Retrieve and co-segment for zero-shot transfer. Adv. Neural Inform. Process. Syst., pages 33754–33767, 2022.
  34. Revisiting weakly supervised pre-training of visual perception models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 804–814, 2022.
  35. Revisiting unreasonable effectiveness of data in deep learning era. In Int. Conf. Comput. Vis., pages 843–852, 2017.
  36. SAM-CLIP: Merging vision foundation models towards semantic and spatial understanding. arXiv preprint arXiv:2310.15308, 2023.
  37. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP, pages 31–35, 2019.
  38. MOFI: Learning image representations from noisy entity annotated images. arXiv preprint arXiv:2306.07952, 2023.
  39. Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. In Adv. Neural Inform. Process. Syst., 2023.
  40. GroupViT: Semantic segmentation emerges from text supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18134–18144, 2022.
  41. Learning open-vocabulary semantic segmentation models from natural language supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2935–2944, 2023.
  42. A simple framework for text-supervised semantic segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7071–7080, 2023.
  43. Modeling context in referring expressions. In Eur. Conf. Comput. Vis., pages 69–85, 2016.
  44. Uncovering prototypical knowledge for weakly open-vocabulary semantic segmentation. In Adv. Neural Inform. Process. Syst., 2023a.
  45. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
  46. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 2023c.
  47. Zhuowen Tu Zheng Ding, Jieke Wang. Open-vocabulary universal image segmentation with maskCLIP. In Int. Conf. Mach. Learn., 2023.
  48. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis., 127:302–321, 2019.
Citations (3)

Summary

  • The paper introduces TagAlign, which enhances vision-language model alignment by integrating a multi-tag classification loss with standard image-text contrastive training.
  • It employs an automated text parsing method using large language models to extract key object and attribute tags from descriptive texts.
  • Extensive experiments show that TagAlign improves semantic segmentation performance by 3.65% compared to baseline models.

Understanding TagAlign: A Simple Approach to Precise Vision-LLM Alignment

In the field of AI, vision-LLMs like CLIP have become incredibly adept at interpreting images and texts. Yet, these models sometimes face challenges with aligning features precisely, such as focusing on an object described with specific attributes. To address this, a new framework called TagAlign offers an elegant solution without demanding additional data.

Tag Parsing with LLMs

At the core of TagAlign is an automated text parsing procedure that employs a LLM. Given any image with a descriptive text, the model identifies objects and attributes within the text—important elements that are also likely to be present in the image. This automatic parsing is beneficial for its high-quality output and scalability.

Multi-Tag Classification

TagAlign uses the tags identified from the parsed text as a form of supervision. By integrating a multi-tag classification loss into the model training, alongside the usual image-text contrastive loss, the system becomes equipped to accurately localize objects specified in the text—effectively aligning both image and text features.

Impressive Experimental Results

Experiments have shown that TagAlign improves average performance on a variety of semantic segmentation datasets by 3.65% over existing methods. Even when evaluating on datasets with or without a background class and using just the Conceptual 12M dataset for training, TagAlign demonstrates superior performance.

Visualizing TagAlign's Efficacy

Through weighted similarity maps, one can visually discern how TagAlign focuses on relevant regions in the image as per the text description. Unlike the baseline model CLIP, which can sometimes highlight the background, TagAlign zooms in on specific objects, showcasing the enhanced precision of this methodology.

The Big Picture

The introduction of TagAlign marks a significant step forward in perfecting the alignment process within vision-LLMs. It does so by cleverly leveraging the advancements in language processing to extract semantics from text and guiding the visual model towards a comprehensive understanding of the described elements. The ability to draw such detailed relations from only image-text pairs is a testament to the power of effective model design and highlights the potential of AI systems that learn more from less. Whether it's integrating attributes into understanding or enhancing textual comprehension, TagAlign expands the capabilities of vision-language embedding models, paving the way for more robust and intricate systems that can deal with a wide array of real-world applications.

Github Logo Streamline Icon: https://streamlinehq.com