Zero-Shot Entity Typing (MZET)

Updated 19 December 2025

Zero-shot entity typing (MZET) is a task that assigns semantic types to entity mentions without any labeled examples during training by leveraging type semantics and external knowledge bases.
MZET approaches employ methods such as memory-augmented embedding, prompt-based inference, and NLI-assisted calibration to map entities and type labels into a shared semantic space.
Evaluation of MZET utilizes metrics like micro-F1 and macro-F1 on benchmarks (e.g., BBN, FIGER, OntoNotes) to highlight the need for effective calibration and handling of unseen type hierarchies.

Zero-shot entity typing (often abbreviated MZET for “Memory-augmented Zero-shot Entity Typing”) is the task of assigning semantic types to entity mentions in text, without access to any labeled examples of the target types during model training. Unlike conventional entity-typing, where models are trained under a fixed label set with direct supervision, MZET and related approaches address the generalization challenge posed by fine-grained, rapidly evolving, or domain-specific type ontologies. This is accomplished by models that transfer knowledge from seen types to previously unseen (zero-shot) types, relying on type semantics, type hierarchies, external knowledge bases, or prompt-based utilization of pretrained LLMs. Zero-shot entity typing has become crucial for robust information extraction, knowledge base population, cross-domain adaptation, and open-domain question answering.

1. Formal Definitions and Problem Setting

Zero-shot entity typing is defined as follows. Let the training set consist of entity mentions and their types drawn from a set of seen types $\mathcal{T}_\mathrm{seen}$ . At inference, the goal is to predict types for mentions that belong to $\mathcal{T}_\mathrm{unseen}$ , where $\mathcal{T}_\mathrm{unseen} \cap \mathcal{T}_\mathrm{seen} = \varnothing$ . The input to the model is a text context $S$ and a marked entity mention $e \in S$ ; the output is a type $t \in \mathcal{T}_\mathrm{unseen}$ (in single-label settings) or a subset of types (multi-label settings). Zero-shot protocols may also involve hierarchical ontologies, complex type definitions (Boolean expressions), and multi-factor evaluation across accuracy, micro-F1, and macro-F1 scores (Zhang et al., 2020, Komarlu et al., 2023, Feng et al., 2023, Zhou et al., 2019).

MZET systems generalize to unseen types by leveraging explicit type semantics (e.g., textual descriptions, hierarchies), cross-type transfer (via prototypes or memory modules), or by mapping context/entity and type into a shared embedding or probabilistic space.

2. Methodological Approaches

Zero-shot entity typing spans a range of paradigms:

A. Memory-Augmented Embedding Models

The “MZET” framework embeds both mentions and type labels (seen/unseen) into a shared semantic space. Mention representation integrates character-level LSTM, word-level embedding, and context encoding (BERT+BiLSTM+attention). Type representation combines BERT-derived semantic embeddings with a hierarchical encoding (adjacency vector over the ontology). The core innovation is a memory network indexed by seen-type embeddings, through which the mention’s representation “attends” to the most related seen types, producing a summary vector. Unseen types are projected into this space via their semantic/hierarchical similarity to seen types, enabling inference for new labels. Association between mention and unseen type is given by:

$y_j = \sigma\left( \sum_{i=1}^{D_s} R_{ij} \ a_{m,i} \right)$

where $R_{ij}$ is the softmax-normalized similarity between seen-type $i$ and unseen-type $j$ (Zhang et al., 2020).

B. Sequence-to-Sequence and Calibration-Based Typing

CASENT recasts ultra-fine entity typing as an autoregressive seq2seq problem where each mention/context is encoded as a string and fine-grained types are generated as sequences from a large vocabulary. Constrained beam search is employed at decoding, restricting generation to valid type prefixes using a prefix trie. The raw sequence log-probabilities are calibrated via a multi-label variant of Platt scaling that corrects for unconditional type biases and frequency effects:

$\widehat{p}(t|e) = \sigma\left( w^{(1)}_{\phi(t)} x_1 + w^{(2)}_{\phi(t)} x_2 + b_{\phi(t)} \right)$

where $x_1 = \log p_\theta(t|e)$ and $x_2 = \log p_\theta(t|\varnothing)$ capture context and model bias respectively; calibration parameters are shared by type-frequency buckets (Feng et al., 2023).

C. Prompt-Learning with Pre-trained LMs

Prompt-based methods convert the typing task into cloze-style masked language modeling or next-word prediction. For example, for entity mention $e$ and type $t_i$ , the input might be “ $e$ is a $t_i$ .” or “In this sentence, $e$ is a [MASK].” The probability assigned by the LM to the completed phrase determines type compatibility (Epure et al., 2021, Ding et al., 2021). In the zero-shot regime, a self-supervised distribution-level loss can enforce that identical mentions in different contexts induce consistent output distributions over type-verbalizers, promoting clustering without labeled data via a Jensen–Shannon loss (Ding et al., 2021).

D. Ontology-Guided and NLI-Assisted Inference

Ontology-guided approaches such as OntoType couple hierarchical label sets with multiple weak sources (PLM predictions via Hearst-pattern prompts, syntactic headwords) and use iterative coarse-to-fine type selection. At each level, a pretrained NLI model computes the likelihood that the sentence entails a type-specific hypothesis (“In this sentence, $e$ is a $t$ .”), and the overall score incorporates both NLI confidence and PLM-based candidate generation (Komarlu et al., 2023).

E. Grounding-Based Type Inference

ZOE grounds mentions to sets of candidate Wikipedia entries using context-ELMo similarity, surface-form frequencies, and ESA-based retrieval. Each target type is defined via a Boolean formula over KB primitives (e.g., FreeBase types). Type assignment consists of scoring compatibility between the mention’s candidate concepts and these symbolic formulas, with no retraining required for new taxonomies (Zhou et al., 2019).

3. Calibration and Confidence Estimation

In multi-label and ultra-fine settings, the calibration of raw scores or probabilities is critical. CASENT demonstrates that sequence log-probabilities from seq2seq models are poor confidence proxies: they reflect both type length and model-internal biases. Calibration involves explicitly modeling the unconditional type distribution (by feeding empty context), sharing calibration weights by frequency, and restricting fit to beam outputs. Ablations reveal that omitting the bias term or not sharing calibration parameters leads to substantially higher error calibration estimates (ECE) and overfitting (Feng et al., 2023). Prompt-based models may use calibrations such as null-prompts and content-free prefix scoring to normalize LM outputs (Epure et al., 2021). OntoType integrates scores from both NLI-extracted entailment probabilities and candidate set voting, balancing semantic and syntactic cues (Komarlu et al., 2023).

4. Empirical Results, Datasets, and Evaluation Protocols

Zero-shot entity typing is evaluated on fine-grained benchmarks such as BBN (57/93 types), OntoNotes (86 types), FIGER (113 types), Wiki datasets, and specialized domains (e.g., biomedical JNLPBA, BC5CDR; reviews-based MIT-restaurant/movie). Key metrics include strict accuracy, micro-F1, and macro-F1; in some cases, single-label accuracy is used (argmax over scores).

State-of-the-art summary (partial):

Model	Dataset	Macro-F1	Micro-F1	Notes
MZET (Zhang et al., 2020)	BBN (ZS)	30.1	31.6	Outperforms ProtoZET, DZET on ZS level 2
CASENT (Feng et al., 2023)	Multi-dom.	n/a	n/a	Avg. 69.4% accuracy zero-shot
OntoType (Komarlu et al., 2023)	OntoNotes	81.5	73.4	Outperforms ZOE, ChatGPT zero-shot
ZOE (Zhou et al., 2019)	FIGER	74.8	71.3	KB-based ZS; strong fine-typing results
PLET(S) (Ding et al., 2021)	Few-NERD	47.98	47.98	Self-supervised masked LM, zero-shot

Ablation studies on MZET confirm that removing memory or word-character modules each reduces accuracy by >2.5%. In CASENT, small beams or omitting type-bias corrections degrade both F1 and ECE. Prompt-based models are highly sensitive to template wording and rare-word exposure (Ding et al., 2021, Feng et al., 2023, Epure et al., 2021).

5. Architectural and Practical Considerations

Key ingredients across MZET systems include:

Mention/context encoding: Joint use of semantic (e.g., BERT, ELMo), character, word, and contextual signals.
Type encoding: Hierarchical and/or semantic representations (e.g., BERT over type names; adjacency vectors).
Type transfer: Mechanisms for mapping unseen types into the representational space of seen-types (e.g., memory networks, similarity projections, or hypothesis-based scoring).
Zero-shot transfer: Either via distributional matching, entailment-based selection, probabilistic calibration, or KB-based symbolic logic.
Calibration: Explicit corrections for model bias, type frequency, or decoding artifacts are paramount, especially in high-cardinality output spaces (Feng et al., 2023).

Prompting-based systems highlight the impact of lexical choice, length bias, and exposure metrics—type synonyms must be carefully selected for unbiased scoring (Epure et al., 2021).

Grounding and symbolic approaches offer strong cross-domain robustness without assuming that unseen type descriptions align with vector space geometry (Zhou et al., 2019). However, they depend on coverage and granularity of external KBs.

6. Limitations, Best Practices, and Future Directions

Noted challenges include:

Long-tail and rare types: Both distributional and symbolic systems struggle when target types are underrepresented or missing from training ontologies or KBs.
Type definition mismatch: Symbolic and prompt-based methods depend on accurate linguistic/KB mappings; errors in ontology or synonym selection propagate downstream (Komarlu et al., 2023, Epure et al., 2021).
Calibration instability: Overfitting of per-type parameters or insufficient beam diversity leads to high test–dev calibration gaps (Feng et al., 2023).
Out-of-vocabulary or foreign mentions: Models generally fail when entity strings are anomalous relative to their pretraining data (Epure et al., 2021, Ding et al., 2021).

Best practices:

Share calibration weights within frequency buckets (CASENT) to combat overfitting (Feng et al., 2023).
Choose type synonyms that are well supported in the underlying LM’s vocabulary (Epure et al., 2021).
For distributional prompt-learning, optimize clustering on mention contexts and negative sampling to avoid trivial solutions (Ding et al., 2021).
Refine ontologies and label hierarchies to match fine-grained distinctions present in modern corpora (Komarlu et al., 2023).
For symbolic systems, combine KB coverage with semantic embeddings for hybrid inference (Zhou et al., 2019).

Future directions involve automatic prompt discovery, hierarchical loss functions, richer negative sampling, joint entity-linking-plus-typing with confidence calibration, and the incremental integration of KB and PLM-based knowledge for robust, taxonomy-agnostic zero-shot typing (Ding et al., 2021, Feng et al., 2023, Komarlu et al., 2023, Zhou et al., 2019).

References:

(Zhang et al., 2020) MZET: Memory Augmented Zero-Shot Fine-grained Named Entity Typing
(Feng et al., 2023) Calibrated Seq2seq Models for Efficient and Generalizable Ultra-fine Entity Typing
(Komarlu et al., 2023) OntoType: Ontology-Guided and Pre-Trained LLM Assisted Fine-Grained Entity Typing
(Ding et al., 2021) Prompt-Learning for Fine-Grained Entity Typing
(Epure et al., 2021) Probing Pre-trained Auto-regressive LLMs for Named Entity Typing and Recognition
(Zhou et al., 2019) Zero-Shot Open Entity Typing as Type-Compatible Grounding