BERT for Coreference Resolution: Baselines and Analysis (1908.09091v4)

Published 24 Aug 2019 in cs.CL

Abstract: We apply BERT to coreference resolution, achieving strong improvements on the OntoNotes (+3.9 F1) and GAP (+11.5 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but distinct entities (e.g., President and CEO). However, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. Our code and models are publicly available.

Authors (4)

Mandar Joshi (24 papers)
Omer Levy (70 papers)
Daniel S. Weld (55 papers)
Luke Zettlemoyer (225 papers)

Citations (312)

View on Semantic Scholar

Summary

The paper demonstrates that BERT-large significantly improves coreference resolution, achieving a 3.9% to 11.5% F1 gain over previous baselines.
It evaluates extensions of the c2f-coref architecture and finds that overlapping context does not offer advantages over independent processing.
The analysis highlights challenges in long-document context and calls for new pretraining strategies to improve document-level understanding.

BERT for Coreference Resolution: Baselines and Analysis

The paper "BERT for Coreference Resolution: Baselines and Analysis" by Joshi et al. explores the application of the BERT model to the task of coreference resolution, exhibiting significant improvements on established benchmarks such as OntoNotes and GAP. The paper presents a thorough experimental analysis comparing the effectiveness of BERT-based models with previous state-of-the-art models including those based on ELMo and rule-based approaches for coreference tasks.

Key Findings and Contributions

The primary contribution of the paper is the novel application of BERT, particularly BERT-large, for coreference resolution, which yields impressive performance gains. Specifically, BERT-large surpasses the ELMo-driven strong baseline model c2f-coref by 3.9% and 11.5% absolute F1 points on the OntoNotes and GAP benchmarks, respectively. These gains are notably higher than those observed for BERT-base, indicating a substantial benefit from leveraging BERT's rich contextual embeddings and increased transformer capacity.

Two specific extensions to the c2f-coref architecture are evaluated: the independent and overlap variants. The independent variant processes non-overlapping segments as separate instances, while the overlap variant employs overlapping segments to artificially extend context beyond the typical 512 token limit imposed by BERT's architecture. The analysis reveals that neither strategy for modeling longer contexts conveys significant advantages, with the overlap variant providing no additional benefit over the independent variant.

Analytical Insights

The paper provides an insightful qualitative analysis of BERT's coreference predictions, highlighting the advantage of BERT-large in distinguishing between related but distinct entities, such as similar geographical names or professional titles. Despite these advancements, the model still exhibits challenges in scenarios involving document-level context comprehension and conversational pronoun resolution. The authors suggest such deficiencies may stem from BERT's pretraining regimen, which prioritizes shorter sequences. This analysis underscores the necessity for further innovations in pretraining strategies that can better capture document-wide information.

Implications for Future Research

The results of this paper suggest meaningful directions for future research in coreference resolution and pretrained LLMs. One implication is the potential for improved pretraining techniques that specifically target document-level context encoding. The research also points to the importance of efficient handling of extensive memory requirements engendered by large models like BERT-large, particularly in the context of span representations.

Future research might explore combining BERT with other neural architectures such as memory-augmented networks or sparse attention mechanisms to better handle long-range dependencies. Furthermore, advancements in model efficiency and memory optimization could facilitate the deployment of large pre-trained models in coreference resolution tasks on broader datasets.

Conclusion

This paper presents a comprehensive exploration of BERT's application to coreference resolution, detailing its strengths and limitations within this domain. By advancing the performance metrics on key benchmarks, the research affirms BERT's capability as a foundational architecture for natural language understanding tasks. It establishes groundwork for ensuing methodological improvements and sets a high baseline for future studies aiming to leverage pretrained transformers in coreference resolution. This work is consequential for researchers looking to enhance model architectures for intricate linguistic tasks that require nuanced understanding of text interactions over extended contexts.

PDF Markdown