Text Mention Embedding (TME) Overview
- Text Mention Embedding (TME) is a semi-parametric approach that embeds and indexes entity mentions as dense vectors to integrate large-scale textual knowledge into Transformer models.
- It constructs a global memory table from approximately 150 million Wikipedia-linked spans, allowing new entity mentions to be appended without retraining.
- TME leverages a dedicated MemoryAttention mechanism within TOME architectures to achieve state-of-the-art performance on knowledge-intensive tasks like FEVER and TriviaQA.
Text Mention Embedding (TME) is a semi-parametric approach to integrating large-scale textual knowledge into deep LLMs, particularly Transformers, by representing and indexing every entity mention in a corpus as a dense vector in a memory table. This method supports reasoning over disparate sources of entity-linked information and provides state-of-the-art performance on knowledge-intensive NLP tasks. The technique is characterized by the construction of a global mention memory, specialized attention layers for memory retrieval, and a hybrid training regime that combines masked language modeling with entity-centric losses (Jong et al., 2021).
1. Mention Extraction and Embedding Construction
TME begins with mention identification in a large raw corpus. Standard Named Entity Recognition (NER) and Wikipedia-linking procedures are applied to extract entity mentions, retaining only those "grounded" to a Wikipedia page. Each mention is marked by inserting two special tokens , around the mention span. For a passage , and mention covering , the rewritten sequence is:
These markers expand each mention span, with their contextualized token vectors (, ) serving as the feature basis for embedding construction.
A BERT-Base-style Transformer (hidden size ) processes the marked sequence, yielding . Each mention is projected via two distinct span-projection layers to obtain a key embedding 0 and value embedding 1 as:
2
Here, 3 and 4 are independent learnable weights. For auxiliary coreference pretraining, an additional span-projection 5 produces a representation 6.
2. Global Mention Memory Structure
The mention encoder is run across approximately 150 million Wikipedia-linked spans, producing global memory tables:
- 7
- 8
- 9, mapping memory rows to Wikipedia entity IDs
Here, 0 million. The memory is semi-parametric: new mentions can be added by simply appending their keys and values after encoding, with no need for retraining. These static tables are never altered during downstream pretraining or fine-tuning; augmentation with new mentions requires only a forward pass through the mention encoder.
3. Memory Attention and Retrieval Mechanism
Within the TOME (Transformer Over Mention Embeddings) architecture, a dedicated MemoryAttention module is interleaved with standard self-attention blocks. For each input mention 1, a query 2 is produced via a new linear projection 3:
4
To render retrieval tractable over 5 candidates, approximate nearest neighbor search (ANNS) is used to find the top-6 closest memory keys to 7:
8
Attention weights are computed locally over 9:
0
The retrieved vector 1 for mention 2 is a weighted sum of the corresponding 3:
4
This vector is projected into the model dimension and injected as a residual at the mention's start position:
5
Only the start marker position is updated in this manner. The output 6 is then supplied to a standard Transformer block, completing the MemoryAttention layer.
4. End-to-End TOME Model Integration
TOME architectures integrate memory-aware reasoning into standard Transformer pipelines via "TOMEBlocks": each block consists of a MemoryAttention layer followed by several Transformer layers. The input sequence is first processed by an initial Transformer stack, then passed through 7 TOMEBlocks:
8
For 9:
0
Two variants are described:
- tome-1: 1 TOMEBlock with 2 Transformer layers
- tome-2: 3 TOMEBlocks with 4 Transformer layers each (total depth 5 BERT-Base)
The final representation 6 serves downstream tasks (e.g., claim verification, QA).
5. Training Regime and Objectives
Mention Encoder Pretraining ("batch-tome"):
- Masked LM loss: 20% of entity mention tokens masked; 10% of non-entity tokens.
- Entity-coreference contrastive loss: For each batch, positives are other mentions with the same Wikipedia ID; negatives are all other mentions. The loss for mention 7 is:
8
Joint optimization uses weights: 0.85 for MLM, 0.15 for coreference, with AdamW (learning rate 9), for 0M steps.
Global TOME Pretraining:
- Weighted sum: 85% MLM, 15% entity-prediction over mention memory.
- Entity-prediction loss: For mention 1, compute:
2
3
Memory entries from the same passage are ignored to prevent information leakage.
Hyperparameters:
- Model hidden size: 4; 5; 6; 7
- Memory size: 8 million
- 9 for MemoryAttention; 0 for entity-prediction loss retrieval
- Optimizer: AdamW, weight decay 1, gradient clipping norm 2
6. Empirical Results and Analysis
TME (with TOME) demonstrates advanced performance on several knowledge-intensive benchmarks. Key empirical results (accuracy reported):
| Task | REALM | EaE | tome-2 |
|---|---|---|---|
| HoVer (test) | 66.1% | 66.6% | 73.1% |
| FEVER (test) | 67.1% | 63.6% | 68.1% |
| FM2 (dev) | 65.8% | 63.5% | 68.4% |
| TriviaQA (test) | 67.1% | 53.4% | 65.8% |
| ComplexWebQA (dev) | 46.7% | 42.7% | 47.7% |
| EntityQuestions (acc) | 59.0% | 32.5% | 66.0% |
Ablations reveal that removing mention coreference pretraining dramatically degrades downstream performance, whereas entity-prediction loss in TOME exerts a more modest effect. Empirical gains scale smoothly with increased memory size up to the full 150M span scale. "Zero-shot" tests—where new entities are introduced to the memory at evaluation time after withholding during training—show no accuracy degradation, confirming the semi-parametric, modular nature of the memory component.
7. Comparison and Theoretical Significance
TME combines scalable fact storage and rapid entity-centric updating with Transformer-based models. The plug-and-play memory architecture enables updating the knowledge base by appending new mentions—no retraining required. Unlike purely parametric memory methods, TME's semi-parametric table can be expanded dynamically, supporting continual learning and adaptation to new entities. The interleaving of MemoryAttention with standard attention enables models to synthesize local and global factual information in a unified, end-to-end trainable structure (Jong et al., 2021).