ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter (2305.07490v6)
Abstract: The success of LLMs has inspired an emerging research field of multimodal learning. However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters. To tackle this challenge, models such as MiniGPT-4 and LLaVA have been developed to fine-tune the pre-trained models using fewer parameters. Despite their promising performance, these models remain limited in their understanding of artistic imagery. To facilitate better artistic-understanding, in this paper, we propose ArtGPT-4, a pioneering large vision-LLM tailored to address the limitations of existing models in artistic comprehension. The key innovation of ArtGPT-4 lies in its craft for the sophisticated challenge of artistic image comprehension, setting it apart from other models that overlook fine details for broader themes. Specifically, it works by integrating some specialized adapter layers into the LLM, enabling the model to more efficiently and effectively parse and interpret complex visual tokens, instead of fine-tuning the whole LLM as in the existing method. ArtGPT-4 has demonstrated its outstanding performance on the efficiency: utilizing a Tesla A100 device, its training can be completed in mere 2 hours with an image-text pair dataset comprising approximately 0.52M entries. Additionally, ArtGPT-4 has also achieved state-of-the-art performance on the ArtEmis and ArtEmis-v2.0 datasets as well as the benchmarks established in this work, lagging behind professional artists' descriptions by a negligible 0.15 points on a 6-point scale. The outstanding performance of ArtGPT-4 shows that it can render images with an artistic-understanding and convey the emotions they inspire, mirroring human interpretation. The code and the pre-trained model are accessible in \url{https://github.com/DLYuanGod/ArtGPT-4}.
- Artemis: Affective language for visual art. CoRR, abs/2101.07396, 2021.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274, 2022.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030–18040, 2022a.
- Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022b.
- Multimodal-gpt: A vision and language model for dialogue with humans, 2023.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pp. 216–225, 2014.
- Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pp. 709–727. Springer, 2022.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pp. 5583–5594. PMLR, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597, 2021.
- M6: A chinese multimodal pretrainer, 2021.
- Visual instruction tuning, 2023.
- Steven Loria et al. textblob documentation. Release 0.15, 2(8):269, 2018.
- It is okay to not be okay: Overcoming emotional bias in affective image captioning by contrastive data collection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume abs/2204.07660, 2022.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: May 3, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), pp. 27730–27744, 2021.
- Mar: Masked autoencoders for efficient action recognition. IEEE Transactions on Multimedia, 2023.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Improved artgan for conditional synthesis of natural image and artwork. IEEE Transactions on Image Processing, 28(1):394–409, 2019. doi: 10.1109/TIP.2018.2866698. URL https://doi.org/10.1109/TIP.2018.2866698.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
- Git: A generative image-to-text transformer for vision and language. Technical report, Microsoft, May 2022. URL https://www.microsoft.com/en-us/research/publication/git-a-generative-image-to-text-transformer-for-vision-and-language/.
- AIM: Adapting image models for efficient video action recognition. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CIoSZ_HKHS7.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1–9, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.