Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Generative Approach for Wikipedia-Scale Visual Entity Recognition (2403.02041v2)

Published 4 Mar 2024 in cs.CV

Abstract: In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24, 2013.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Extreme Classification (Dagstuhl Seminar 18291). Dagstuhl Reports, 2019.
  4. Food-101–mining discriminative components with random forests. In ECCV, 2014.
  5. Pali: A jointly-scaled multilingual language-image model. ICLR, 2023.
  6. Autoregressive entity retrieval. arXiv preprint arXiv:2010.00904, 2020.
  7. Scenic: A JAX library for computer vision research and beyond. arXiv preprint arXiv:2110.11403, 2021.
  8. The pascal visual object classes (voc) challenge. IJCV, 88, 2010.
  9. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR, 2004.
  10. CurriculumNet: Weakly supervised learning from large-scale web images. In ECCV, pages 135–150, 2018.
  11. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010.
  12. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. ICCV, 2023.
  13. Learning with neighbor consistency for noisy labels. In CVPR, 2022.
  14. Improving image recognition by retrieving from web-scale image-text data. CVPR, 2023.
  15. Retrieval-enhanced contrastive vision-text models. ICLR, 2024.
  16. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  17. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, CVPR, 2011.
  18. 3d object representations for fine-grained categorization. In ICCV, 2013.
  19. Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018.
  20. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  21. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, 2019.
  22. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017.
  23. Learning customized visual models with retrieval-augmented knowledge. In CVPR, 2023.
  24. Large-scale long-tailed recognition in an open world. In CVPR, 2019.
  25. Dsi++: Updating transformer memory with new documents. arXiv preprint arXiv:2212.09744, 2022.
  26. Multi-modal extreme classification. In CVPR, 2022.
  27. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  28. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
  29. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  30. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  31. How does generative retrieval scale to millions of passages? arXiv preprint arXiv:2305.11841, 2023.
  32. Learning transferable visual models from natural language supervision. In ICML, 2021.
  33. Recommender systems with generative retrieval. arXiv preprint arXiv:2305.05065, 2023.
  34. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 2009.
  35. Imagenet large scale visual recognition challenge. International journal of computer vision, 2015.
  36. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
  37. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  39. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  40. Parameter-efficient long-tailed recognition. arXiv preprint arXiv:2309.10019, 2023.
  41. Learning to tokenize for generative retrieval. NeurIPS, 2023.
  42. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 2022.
  43. The inaturalist species classification and detection dataset. In CVPR, 2018.
  44. The caltech-ucsd birds-200-2011 dataset. 2011.
  45. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  46. A neural corpus indexer for document retrieval. NeurIPS, 2022b.
  47. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In CVPR, 2020.
  48. Scaling vision transformers. In CVPR, 2022.
  49. Irgen: Generative modeling for image retrieval. arXiv preprint arXiv:2303.10126, 2023.
  50. Webface260m: A benchmark for million-scale deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Mathilde Caron (25 papers)
  2. Ahmet Iscen (29 papers)
  3. Alireza Fathi (31 papers)
  4. Cordelia Schmid (206 papers)
Citations (4)