Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages (1605.05894v2)

Published 19 May 2016 in cs.CL, cs.CY, and cs.SI

Abstract: Microblogging platforms such as Twitter provide active communication channels during mass convergence and emergency events such as earthquakes, typhoons. During the sudden onset of a crisis situation, affected people post useful information on Twitter that can be used for situational awareness and other humanitarian disaster response efforts, if processed timely and effectively. Processing social media information pose multiple challenges such as parsing noisy, brief and informal messages, learning information categories from the incoming stream of messages and classifying them into different classes among others. One of the basic necessities of many of these tasks is the availability of data, in particular human-annotated data. In this paper, we present human-annotated Twitter corpora collected during 19 different crises that took place between 2013 and 2015. To demonstrate the utility of the annotations, we train machine learning classifiers. Moreover, we publish first largest word2vec word embeddings trained on 52 million crisis-related tweets. To deal with tweets language issues, we present human-annotated normalized lexical resources for different lexical variations.

Citations (309)

View on Semantic Scholar

Summary

The paper’s main contribution is the creation of a comprehensive annotated Twitter dataset from 19 crises for robust NLP training.
It details methodologies for normalizing noisy, crisis-related language and employs classifiers like SVM and Random Forest to improve accuracy.
The study shows that accessible, open-source data significantly enhances situational awareness and supports effective disaster response.

Analyzing Human-Annotated Twitter Corpora for Crisis-Related NLP Applications

The paper "Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages" by Muhammad Imran, Prasenjit Mitra, and Carlos Castillo provides an in-depth exploration of using Twitter data as a significant resource for processing crisis-related information. The paper addresses the complex nature of Twitter data and establishes methodologies for enhancing crisis informatics through the development of human-annotated corpora, classifier training, and normalization of informal language.

The primary contribution of this work is the construction of a substantial dataset comprising over 52 million Twitter messages collected from 19 crises between 2013 and 2015. The authors emphasize the critical role of human annotation in categorizing these tweets into relevant classes—such as displaced people, financial needs, and infrastructure damage—that are crucial for disaster response. This annotated dataset forms a foundation for training supervised machine learning models aimed at improving situational awareness and response capacity during crises.

Furthermore, the paper explores the challenges of working with Twitter's noisy and informal language by providing human-annotated normalized lexical resources. The authors have identified out-of-vocabulary terms prevalent in tweets during crises and rectified them to assist in improving the performance of NLP models. This effort is pivotal given the brevity and informal nature of social media texts, which are often riddled with abbreviations, slangs, and grammatical inconsistencies.

In the field of machine learning, the authors employ well-established algorithms—Naive Bayes, Random Forests, and Support Vector Machines—to classify the crisis-related tweets into multiple categories. Their findings suggest that these classifiers perform effectively with high area-under-ROC scores across various classes. The paper further complements this classification task with the creation of the largest word2vec word embeddings based on crisis-related tweets available to date, providing a valuable resource to the research community.

In terms of practical implications, this paper provides instrumental tools for formal crisis response agencies looking to quickly and accurately assess situational information from vast streams of data on social media. Additionally, the authors propose that the publication and open-source sharing of these resources will enable further research and the development of more effective computational methods, contributing to advancements in crisis informatics.

Theoretically, the work presents a comprehensive approach to building machine learning datasets from human-annotated social media data. This paper challenges the notion of dialectal variation and suggests that diverse datasets can enhance the generalizability of models across different linguistic contexts.

Looking forward, this research opens avenues for further studies focused on improving NLP in short, informal message contexts. Future work could explore deeper machine learning models such as transformers, which may further refine classification and enhance semantic understanding in crisis situations. Additionally, incorporating real-time data annotation frameworks could lead to even more dynamic and responsive crisis management systems.

Overall, this paper presents significant contributions in leveraging social media data for practical applications in crisis management and opens the door for future expansion in the field of natural language processing within crisis informatics.

PDF Markdown

Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages (1605.05894v2)

Summary

Analyzing Human-Annotated Twitter Corpora for Crisis-Related NLP Applications

Related Papers