Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval (2309.12110v1)

Published 21 Sep 2023 in cs.CV

Abstract: Given the recent advances in multimodal image pretraining where visual models trained with semantically dense textual supervision tend to have better generalization capabilities than those trained using categorical attributes or through unsupervised techniques, in this work we investigate how recent CLIP model can be applied in several tasks in artwork domain. We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web. On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. NetVLAD: CNN architecture for weakly supervised place recognition. In Proc. of CVPR, 2016.
  2. Noisyart: A dataset for webly-supervised artwork recognition. In VISIGRAPP (4: VISAPP), pages 467–475, 2019.
  3. Webly-supervised zero-shot learning for artwork instance recognition. Pattern Recognition Letters, 128:420–426, 2019. ISSN 0167-8655. https://doi.org/10.1016/j.patrec.2019.09.027. URL https://www.sciencedirect.com/science/article/pii/S0167865519302739.
  4. Revisiting the VLAD image representation. In Proc. of ACM MM, 2013.
  5. K. Desai and J. Johnson. VirTex: Learning Visual Representations from Textual Annotations. In CVPR, 2021.
  6. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper/2013/file/7cce53cf90577442771720a370c3c723-Paper.pdf.
  7. Deep residual learning for image recognition. In Proc. of CVPR, pages 770–778, 2016. 10.1109/CVPR.2016.90.
  8. Improving bag-of-features for large scale image search. International Journal of Computer Vision, 87(3):316–336, 2010. 10.1007/s11263-009-0285-2. URL https://doi.org/10.1007/s11263-009-0285-2.
  9. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1704–1716, Sep. 2012. ISSN 1939-3539. 10.1109/TPAMI.2011.235.
  10. Cross-dimensional weighting for aggregated deep convolutional features, 2016.
  11. Big transfer (bit): General visual representation learning, 2020.
  12. Imagenet classification with deep convolutional neural networks. In Proc. of NIPS, 2012.
  13. Exploring the limits of weakly supervised pretraining, 2018.
  14. Learning vocabularies over a fine quantization. International Journal of Computer Vision, 103(1):163–175, 2013.
  15. Improving the Fisher kernel for large-scale image classification. In Proc. of ECCV, 2010.
  16. Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1655–1668, 2018.
  17. Learning transferable visual models from natural language supervision, 2021.
  18. B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2152–2161, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/romera-paredes15.html.
  19. Imagenet large scale visual recognition challenge, 2015.
  20. Learning visual representations with caption annotations, 2020.
  21. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128(2):336–359, Oct 2019. ISSN 1573-1405. 10.1007/s11263-019-01228-7. URL http://dx.doi.org/10.1007/s11263-019-01228-7.
  22. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  23. J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In Proc. of ICCV, Oct 2003. 10.1109/ICCV.2003.1238663.
  24. Zero-shot learning through cross-modal transfer. Advances in neural information processing systems, 26, 2013.
  25. Particular object retrieval with integral max-pooling of CNN activations. In Proc. of ICLR, 2016.
  26. Image retrieval using multi-scale CNN features pooling, 2020.
  27. Contrastive learning of medical visual representations from paired images and text, 2020.
  28. Sift meets cnn: A decade survey of instance retrieval, 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.