Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Autoregressive Entity Linking (2103.12528v1)

Published 23 Mar 2021 in cs.CL, cs.AI, and stat.ML

Abstract: We present mGENRE, a sequence-to-sequence system for the Multilingual Entity Linking (MEL) problem -- the task of resolving language-specific mentions to a multilingual Knowledge Base (KB). For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token in an autoregressive fashion. The autoregressive formulation allows us to effectively cross-encode mention string and entity names to capture more interactions than the standard dot product between mention and entity vectors. It also enables fast search within a large KB even for mentions that do not appear in mention tables and with no need for large-scale vector indices. While prior MEL works use a single representation for each entity, we match against entity names of as many languages as possible, which allows exploiting language connections between source input and target name. Moreover, in a zero-shot setting on languages with no training data at all, mGENRE treats the target language as a latent variable that is marginalized at prediction time. This leads to over 50% improvements in average accuracy. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks where mGENRE establishes new state-of-the-art results. Code and pre-trained models at https://github.com/facebookresearch/GENRE.

Insights on "Multilingual Autoregressive Entity Linking"

The paper presents GENRE, a sequence-to-sequence model developed for solving the Multilingual Entity Linking (MEL) task. This task involves linking entity mentions in texts across different languages to a multilingual Knowledge Base (KB), such as Wikidata. Unlike traditional methods, GENRE employs an autoregressive strategy, predicting entity names token-by-token. This nuanced approach enables enhanced interaction between mention strings and entity names, surpassing conventional similarity measures like dot products.

Key Contributions

GENRE's methodology is distinct in several ways:

  1. Autoregressive Sequence-to-Sequence Model: The model cross-encodes mention strings and potential entity names, allowing it to capture intricate interactions that traditional methods might overlook.
  2. Multilingual KB Representation: By considering entity names across a range of languages, GENRE can leverage language connections, which is particularly beneficial for zero-shot learning scenarios where no training data exists for certain languages.
  3. Innovative Objective Function: GENRE introduces a novel objective function that marginalizes over languages during prediction. This strategic approach enhances the model’s performance, especially in zero-shot settings, as demonstrated by over 50% improvement in accuracy.
  4. Efficiency and Storage: The model operates efficiently without large-scale vector indices, necessary for traditional systems, maintaining a practical memory footprint (~2.2GB for ~89M names).
  5. Comprehensive Evaluation: The paper evaluates GENRE on various MEL benchmarks, including Mewsli-9, TR2016hard^\text{hard}, and TAC-KBP2015, establishing new state-of-the-art results. The results show significant micro and macro average accuracy improvements, demonstrating GENRE's robustness across languages.

Implications and Future Directions

The autoregressive nature of GENRE implies significant advancements in natural language understanding by offering a more flexible approach toward entity linking across languages. It effectively adapts to unseen languages and entities by relying on the prediction of entity names, leveraging lexical overlap, transliteration, or translation. This adaptability is crucial for applications involving multilingual contexts, such as global customer service platforms, international news aggregation, and cross-lingual information retrieval.

Future developments may include refining GENRE’s strategies for rare entity mentions and optimizing its performance further on high-volume entities. Additionally, exploration into handling different dialects and expanding the current model to incorporate detailed entity descriptions could provide further enhancements.

Overall, GENRE embodies a shift toward more adaptive, language-agnostic models, setting a foundation for future research in multilingual natural language processing tasks. By addressing both theoretical and practical limitations of prior methodologies, GENRE represents a significant stride toward achieving more nuanced entity linking capabilities in a globally interconnected digital landscape.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Nicola De Cao (21 papers)
  2. Ledell Wu (16 papers)
  3. Kashyap Popat (7 papers)
  4. Mikel Artetxe (52 papers)
  5. Naman Goyal (37 papers)
  6. Mikhail Plekhanov (4 papers)
  7. Luke Zettlemoyer (225 papers)
  8. Nicola Cancedda (16 papers)
  9. Sebastian Riedel (140 papers)
  10. Fabio Petroni (37 papers)
Citations (83)
Github Logo Streamline Icon: https://streamlinehq.com