Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis (2201.07281v2)

Published 18 Jan 2022 in cs.CL

Abstract: Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. To date, there is no complete training corpus for both NER and syntactic analysis (e.g., part of speech tagging, dependency parsing) of tweets. While there are some publicly available annotated NLP datasets of tweets, they are only designed for individual tasks. In this study, we aim to create Tweebank-NER, an English NER corpus based on Tweebank V2 (TB2), train state-of-the-art (SOTA) Tweet NLP models on TB2, and release an NLP pipeline called Twitter-Stanza. We annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train the Stanza pipeline on TB2 and compare with alternative NLP frameworks (e.g., FLAIR, spaCy) and transformer-based models. The Stanza tokenizer and lemmatizer achieve SOTA performance on TB2, while the Stanza NER tagger, part-of-speech (POS) tagger, and dependency parser achieve competitive performance against non-transformer models. The transformer-based models establish a strong baseline in Tweebank-NER and achieve the new SOTA performance in POS tagging and dependency parsing on TB2. We release the dataset and make both the Stanza pipeline and BERTweet-based models available "off-the-shelf" for use in future Tweet NLP research. Our source code, data, and pre-trained models are available at: \url{https://github.com/social-machines/TweebankNLP}.

Citations (20)

View on Semantic Scholar

Summary

The paper presents Tweebank-NER, a rigorously annotated Twitter corpus that enables multi-task training for NER, POS tagging, and dependency parsing.
It employs a comprehensive NLP pipeline using Stanza and transformer-based models like BERTweet to achieve state-of-the-art tokenization, lemmatization, and POS tagging.
Empirical results demonstrate that integrating domain-specific data with advanced models significantly improves social media text analysis despite inherent challenges.

Insights into "Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis"

The paper discusses the creation and utilization of the Tweebank-NER, an English Named Entity Recognition (NER) corpus derived from the Tweebank V2 dataset, specifically tailored for the analysis of Twitter data. This work is crucial in advancing NLP methodologies for social media platforms where data form is typically short, noisy, and colloquial, posing unique challenges compared to more formal text sources.

Contributions and Methodology

The paper primarily focuses on two domains: (1) the development of a comprehensive NER annotation for the Tweebank V2 corpus and (2) the construction of robust NLP models optimized for Twitter texts.

Corpus Expansion and Annotation:
- The paper describes the annotation of Tweebank V2 with named entities using Amazon Mechanical Turk, following a rigorous annotation scheme. The authors report a satisfactory inter-annotator agreement, indicative of the high quality of annotations.
- This newly annotated corpus, referred to as Tweebank-NER, fills a notable gap by enabling the concurrent training of multi-task models on syntactic parsing, POS tagging, and NER, thereby enhancing the models' domain adaptability for Twitter data.
Development of NLP Models:
- The authors leverage the Stanza framework to develop a comprehensive NLP pipeline named Twitter-Stanza, capable of state-of-the-art performance in tokenization and lemmatization, while maintaining competitive edge in other tasks.
- The introduction of transformer-based models, particularly those based on BERTweet, establishes new performance benchmarks on the Tweebank-NER dataset for POS tagging and dependency parsing.

Evaluation and Findings

The paper undertakes extensive evaluation of the NLP models using the newly annotated dataset and compares performance with existing frameworks like spaCy and FLAIR. Key observations include:

The integration of transformer-based models demonstrates marked improvements in POS tagging and NER, attributed to their ability to leverage large-scale pre-trained representations.
Stanza-based models outperform other non-transformer frameworks in tokenization and lemmatization accuracy, underscoring the efficacy of its ensemble approach combining dictionary lookup with seq2seq lemmatization strategies.
The empirical results reveal that the combination of Tweebank V2 and UD_English-EWT training datasets slightly diminishes performance in more complex tasks like NER and dependency parsing for transformer models, highlighting the importance of domain-specific data alignment.

Implications and Future Prospects

The release of the dataset and models, notably the off-the-shelf Twitter-Stanza and BERTweet-based tools on platforms such as the Hugging Face Hub, represents a valuable resource for the research community. These contributions are set to facilitate further research and practical application in the field of social media analysis.

The paper provides a foundation upon which future research can build by emphasizing (1) the integration of world and domain knowledge into current NER systems to address prediction challenges with contextually ambiguous entities and (2) the exploration of domain adaptation strategies for improved cross-corpus performance.

In summary, the work represents a meaningful advancement in the adaptation of NLP tools to manage the distinct demands of Twitter data, laying the groundwork for enriched social media data analysis and comprehension.

PDF Markdown

Related Papers

GitHub

GitHub - mit-ccc/TweebankNLP: An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset (103 stars)