Open-Set Image Tagging with Multi-Grained Text Supervision (2310.15200v2)

Published 23 Oct 2023 in cs.CV

Abstract: In this paper, we introduce the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. Previous approaches (e.g., CLIP) primarily utilize global text supervision paired with images, leading to sub-optimal performance in recognizing multiple individual semantic tags. In contrast, RAM++ seamlessly integrates individual tag supervision with global text supervision, all within a unified alignment framework. This integration not only ensures efficient recognition of predefined tag categories, but also enhances generalization capabilities for diverse open-set categories. Furthermore, RAM++ employs LLMs to convert semantically constrained tag supervision into more expansive tag description supervision, thereby enriching the scope of open-set visual description concepts. Comprehensive evaluations on various image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) open-set image tagging models on most aspects. Specifically, for predefined commonly used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at \url{https://github.com/xinyu1205/recognize-anything}.

Authors (9)

Xinyu Huang (75 papers)
Yi-Jie Huang (4 papers)
Youcai Zhang (44 papers)
Weiwei Tian (5 papers)
Rui Feng (67 papers)
Yuejie Zhang (31 papers)
Yanchun Xie (7 papers)
Yaqian Li (17 papers)
Lei Zhang (1689 papers)

Citations (20)

View on Semantic Scholar

Summary

Open-Set Image Tagging with Multi-Grained Text Supervision

The paper introduces the Recognize Anything Plus Model (RAM++), designed for open-set image tagging utilizing multi-grained text supervision. This approach addresses limitations in prior models, such as CLIP, which predominantly integrate global text supervision. By combining individual tag supervision with global text supervision within a cohesive alignment framework, RAM++ not only enhances the recognition of predefined tag categories but also bolsters generalization for diverse open-set categories.

Key Contributions

Unified Alignment Framework: RAM++ integrates image-tag-text triplets within a unified alignment framework. This involves leveraging image-text and image-tag alignments concurrently through a shared alignment decoder. Such an integration is pivotal for improving tagging accuracy on both predefined and open-set categories.
LLM-Based Tag Description: The model utilizes LLMs to expand tag supervision into descriptive tag supervision. This transformation enhances the model's capability to perceive a broader scope of visual concepts, critical for open-set recognition.
State-of-the-Art Performance: RAM++ demonstrates superiority over existing models across several benchmarks. For predefined common categories, it surpasses CLIP by 10.2 mAP and 15.4 mAP on OpenImages and ImageNet, respectively. For open-set categories, RAM++ records improvements of 5.0 mAP and 6.4 mAP over CLIP and RAM on OpenImages benchmarks.

Technical Innovations

Multi-Grained Text Supervision: RAM++ integrates both global text supervision and individual tag supervision, improving recognition tasks that require localized feature identification.
Efficient Alignment Decoder: The alignment decoder differentiates RAM++ from other approaches by ensuring efficient recognition across numerous categories without performance degradation.
Automatic Re-weighting Mechanism: This mechanism addresses the integration of multiple tag descriptions, enhancing the model’s semantic alignment by re-weighting tag descriptions based on their contextual relevance to the image features.

Implications and Future Directions

The implications of RAM++ extend to enhancing the versatility of image recognition models, particularly in applications requiring robust open-set recognition. The integration of LLM knowledge during the training stage marks a significant shift, potentially influencing future research towards developing models that seamlessly blend visual and textual data more effectively.

Looking forward, further optimization of dataset scales could enhance RAM++’s capabilities, particularly for rare categories. Moreover, exploring the balance between alignment efficiency and performance remains crucial for refining open-set recognition.

Overall, RAM++ contributes an effective solution to open-set image tagging, setting new benchmarks in leveraging multi-grained text supervision. Its novel approaches to model architecture and supervision open pathways for subsequent advancements in image tagging and recognition models.

PDF Markdown

Related Papers

GitHub

GitHub - xinyu1205/recognize-anything: Open-source and strong foundation image recognition models. (2,846 stars)