Papers
Topics
Authors
Recent
Search
2000 character limit reached

Text Mention Embedding (TME) Overview

Updated 21 May 2026
  • Text Mention Embedding (TME) is a semi-parametric approach that embeds and indexes entity mentions as dense vectors to integrate large-scale textual knowledge into Transformer models.
  • It constructs a global memory table from approximately 150 million Wikipedia-linked spans, allowing new entity mentions to be appended without retraining.
  • TME leverages a dedicated MemoryAttention mechanism within TOME architectures to achieve state-of-the-art performance on knowledge-intensive tasks like FEVER and TriviaQA.

Text Mention Embedding (TME) is a semi-parametric approach to integrating large-scale textual knowledge into deep LLMs, particularly Transformers, by representing and indexing every entity mention in a corpus as a dense vector in a memory table. This method supports reasoning over disparate sources of entity-linked information and provides state-of-the-art performance on knowledge-intensive NLP tasks. The technique is characterized by the construction of a global mention memory, specialized attention layers for memory retrieval, and a hybrid training regime that combines masked language modeling with entity-centric losses (Jong et al., 2021).

1. Mention Extraction and Embedding Construction

TME begins with mention identification in a large raw corpus. Standard Named Entity Recognition (NER) and Wikipedia-linking procedures are applied to extract entity mentions, retaining only those "grounded" to a Wikipedia page. Each mention is marked by inserting two special tokens [Estart][E_{start}], [Eend][E_{end}] around the mention span. For a passage x1,…,xTx_1,\ldots,x_T, and mention covering xs,…,xex_s,\ldots,x_e, the rewritten sequence is:

...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...

These markers expand each mention span, with their contextualized token vectors (HsH_s, HeH_e) serving as the feature basis for embedding construction.

A BERT-Base-style Transformer (hidden size d=768d=768) processes the marked sequence, yielding H∈RT×dH \in \mathbb{R}^{T \times d}. Each mention m=(s,e)m=(s,e) is projected via two distinct span-projection layers to obtain a key embedding [Eend][E_{end}]0 and value embedding [Eend][E_{end}]1 as:

[Eend][E_{end}]2

Here, [Eend][E_{end}]3 and [Eend][E_{end}]4 are independent learnable weights. For auxiliary coreference pretraining, an additional span-projection [Eend][E_{end}]5 produces a representation [Eend][E_{end}]6.

2. Global Mention Memory Structure

The mention encoder is run across approximately 150 million Wikipedia-linked spans, producing global memory tables:

  • [Eend][E_{end}]7
  • [Eend][E_{end}]8
  • [Eend][E_{end}]9, mapping memory rows to Wikipedia entity IDs

Here, x1,…,xTx_1,\ldots,x_T0 million. The memory is semi-parametric: new mentions can be added by simply appending their keys and values after encoding, with no need for retraining. These static tables are never altered during downstream pretraining or fine-tuning; augmentation with new mentions requires only a forward pass through the mention encoder.

3. Memory Attention and Retrieval Mechanism

Within the TOME (Transformer Over Mention Embeddings) architecture, a dedicated MemoryAttention module is interleaved with standard self-attention blocks. For each input mention x1,…,xTx_1,\ldots,x_T1, a query x1,…,xTx_1,\ldots,x_T2 is produced via a new linear projection x1,…,xTx_1,\ldots,x_T3:

x1,…,xTx_1,\ldots,x_T4

To render retrieval tractable over x1,…,xTx_1,\ldots,x_T5 candidates, approximate nearest neighbor search (ANNS) is used to find the top-x1,…,xTx_1,\ldots,x_T6 closest memory keys to x1,…,xTx_1,\ldots,x_T7:

x1,…,xTx_1,\ldots,x_T8

Attention weights are computed locally over x1,…,xTx_1,\ldots,x_T9:

xs,…,xex_s,\ldots,x_e0

The retrieved vector xs,…,xex_s,\ldots,x_e1 for mention xs,…,xex_s,\ldots,x_e2 is a weighted sum of the corresponding xs,…,xex_s,\ldots,x_e3:

xs,…,xex_s,\ldots,x_e4

This vector is projected into the model dimension and injected as a residual at the mention's start position:

xs,…,xex_s,\ldots,x_e5

Only the start marker position is updated in this manner. The output xs,…,xex_s,\ldots,x_e6 is then supplied to a standard Transformer block, completing the MemoryAttention layer.

4. End-to-End TOME Model Integration

TOME architectures integrate memory-aware reasoning into standard Transformer pipelines via "TOMEBlocks": each block consists of a MemoryAttention layer followed by several Transformer layers. The input sequence is first processed by an initial Transformer stack, then passed through xs,…,xex_s,\ldots,x_e7 TOMEBlocks:

xs,…,xex_s,\ldots,x_e8

For xs,…,xex_s,\ldots,x_e9:

...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...0

Two variants are described:

  • tome-1: ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...1 TOMEBlock with ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...2 Transformer layers
  • tome-2: ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...3 TOMEBlocks with ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...4 Transformer layers each (total depth ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...5 BERT-Base)

The final representation ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...6 serves downstream tasks (e.g., claim verification, QA).

5. Training Regime and Objectives

Mention Encoder Pretraining ("batch-tome"):

  • Masked LM loss: 20% of entity mention tokens masked; 10% of non-entity tokens.
  • Entity-coreference contrastive loss: For each batch, positives are other mentions with the same Wikipedia ID; negatives are all other mentions. The loss for mention ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...7 is:

...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...8

Joint optimization uses weights: 0.85 for MLM, 0.15 for coreference, with AdamW (learning rate ...,[Estart],xs,…,xe,[Eend],......, [E_{start}], x_s,\ldots,x_e, [E_{end}], ...9), for HsH_s0M steps.

Global TOME Pretraining:

  • Weighted sum: 85% MLM, 15% entity-prediction over mention memory.
  • Entity-prediction loss: For mention HsH_s1, compute:

HsH_s2

HsH_s3

Memory entries from the same passage are ignored to prevent information leakage.

Hyperparameters:

  • Model hidden size: HsH_s4; HsH_s5; HsH_s6; HsH_s7
  • Memory size: HsH_s8 million
  • HsH_s9 for MemoryAttention; HeH_e0 for entity-prediction loss retrieval
  • Optimizer: AdamW, weight decay HeH_e1, gradient clipping norm HeH_e2

6. Empirical Results and Analysis

TME (with TOME) demonstrates advanced performance on several knowledge-intensive benchmarks. Key empirical results (accuracy reported):

Task REALM EaE tome-2
HoVer (test) 66.1% 66.6% 73.1%
FEVER (test) 67.1% 63.6% 68.1%
FM2 (dev) 65.8% 63.5% 68.4%
TriviaQA (test) 67.1% 53.4% 65.8%
ComplexWebQA (dev) 46.7% 42.7% 47.7%
EntityQuestions (acc) 59.0% 32.5% 66.0%

Ablations reveal that removing mention coreference pretraining dramatically degrades downstream performance, whereas entity-prediction loss in TOME exerts a more modest effect. Empirical gains scale smoothly with increased memory size up to the full 150M span scale. "Zero-shot" tests—where new entities are introduced to the memory at evaluation time after withholding during training—show no accuracy degradation, confirming the semi-parametric, modular nature of the memory component.

7. Comparison and Theoretical Significance

TME combines scalable fact storage and rapid entity-centric updating with Transformer-based models. The plug-and-play memory architecture enables updating the knowledge base by appending new mentions—no retraining required. Unlike purely parametric memory methods, TME's semi-parametric table can be expanded dynamically, supporting continual learning and adaptation to new entities. The interleaving of MemoryAttention with standard attention enables models to synthesize local and global factual information in a unified, end-to-end trainable structure (Jong et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text Mention Embedding (TME).