A Generative Approach for Wikipedia-Scale Visual Entity Recognition (2403.02041v2)

Published 4 Mar 2024 in cs.CV

Abstract: In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (50)

Authors (4)

Mathilde Caron (25 papers)
Ahmet Iscen (29 papers)
Alireza Fathi (31 papers)
Cordelia Schmid (206 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/WikiResearch/status/1767616719013457970

https://twitter.com/altndrr/status/1772404777432821898

A Generative Approach for Wikipedia-Scale Visual Entity Recognition (2403.02041v2)

Related Papers

Tweets