Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Image-Text Pre-Training for Logo Recognition (2309.10206v1)

Published 18 Sep 2023 in cs.CV

Abstract: Open-set logo recognition is commonly solved by first detecting possible logo regions and then matching the detected parts against an ever-evolving dataset of cropped logo images. The matching model, a metric learning problem, is especially challenging for logo recognition due to the mixture of text and symbols in logos. We propose two novel contributions to improve the matching model's performance: (a) using image-text paired samples for pre-training, and (b) an improved metric learning loss function. A standard paradigm of fine-tuning ImageNet pre-trained models fails to discover the text sensitivity necessary to solve the matching problem effectively. This work demonstrates the importance of pre-training on image-text pairs, which significantly improves the performance of a visual embedder trained for the logo retrieval task, especially for more text-dominant classes. We construct a composite public logo dataset combining LogoDet3K, OpenLogo, and FlickrLogos-47 deemed OpenLogoDet3K47. We show that the same vision backbone pre-trained on image-text data, when fine-tuned on OpenLogoDet3K47, achieves $98.6\%$ recall@1, significantly improving performance over pre-training on Imagenet1K ($97.6\%$). We generalize the ProxyNCA++ loss function to propose ProxyNCAHN++ which incorporates class-specific hard negative images. The proposed method sets new state-of-the-art on five public logo datasets considered, with a $3.5\%$ zero-shot recall@1 improvement on LogoDet3K test, $4\%$ on OpenLogo, $6.5\%$ on FlickrLogos-47, $6.2\%$ on Logos In The Wild, and $0.6\%$ on BelgaLogo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Large scale open-set deep logo detection, 2022.
  2. Deep learning for logo recognition. Neurocomput., 245(C):23–30, jul 2017.
  3. Yolov4: Optimal speed and accuracy of object detection, 2020.
  4. A neural-based architecture for spot-noisy logo recognition. In Proceedings of the Fourth International Conference on Document Analysis and Recognition, volume 1, pages 175–179 vol.1, 1997.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Logo recognition using geometric invariants. In Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR ’93), pages 894–897, 1993.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Scalable logo recognition using proxies. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 715–725, 2019.
  10. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  11. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
  12. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  13. A multimodal fusion framework for brand recognition from product image and context. In 2020 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 1–4. IEEE, 2020.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  15. Logo retrieval with a contrario visual query expansion. In MM ’09: Proceedings of the seventeen ACM international conference on Multimedia, pages 581–584, 2009.
  16. Scalable triangulation-based logo recognition. In in Proceedings of ACM International Conference on Multimedia Retrieval (ICMR 2011), Trento, Italy, April 2011.
  17. Spatially aware multimodal transformers for textvqa. In European Conference on Computer Vision, pages 715–732. Springer, 2020.
  18. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in Neural Information Processing Systems, 33:2611–2624, 2020.
  19. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Seetek: Very large-scale open-set logo recognition with text-aware metric learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2544–2553, January 2022.
  22. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 2021.
  23. Mutual enhancement for detection of multiple logos in sports videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 4856–4865, 2017.
  24. David G Lowe. Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, volume 2, pages 1150–1157. Ieee, 1999.
  25. Torchvision the machine-vision package of torch. Proceedings of the 18th ACM international conference on Multimedia, 2010.
  26. No fuss distance metric learning using proxies. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 360–368, Los Alamitos, CA, USA, oct 2017. IEEE Computer Society.
  27. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
  28. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
  29. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  30. Scalable logo recognition in real-world images. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR ’11, New York, NY, USA, 2011. Association for Computing Machinery.
  31. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
  32. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  33. Weblogo-2m: Scalable logo detection by deep learning from the web. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 270–279, 2017.
  34. Open logo detection challenge. In British Machine Vision Conference, 2018.
  35. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. arXiv preprint arXiv:2004.01113, 2020.
  36. Metu dataset: A big dataset for benchmarking trademark retrieval. In 2015 14th IAPR International Conference on Machine Vision Applications (MVA), pages 514–517. IEEE, 2015.
  37. Open set logo detection and retrieval. arXiv preprint arXiv:1710.10891, 2017.
  38. Open set logo detection and retrieval. In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP,, pages 284–292. INSTICC, SciTePress, 2018.
  39. Logodet-3k: A large-scale image dataset for logo detection. ACM Trans. Multimedia Comput. Commun. Appl., 18(1), jan 2022.
  40. Tap: Text-aware pre-training for text-vqa and text-caption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8751–8761, 2021.
  41. Classification is a strong baseline for deep metric learning, 2019.
  42. Simple is not easy: A simple strong baseline for textvqa and textcaps. arXiv preprint arXiv:2012.05153, 2, 2020.
Citations (3)

Summary

We haven't generated a summary for this paper yet.