HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks (2104.03090v2)

Published 7 Apr 2021 in cs.CL, cs.AI, cs.CY, cs.LG, and cs.SI

Abstract: Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models. In this paper, we present a new large-scale dataset with ~77K human-labeled tweets, sampled from a pool of ~24 million tweets across 19 disaster events that happened between 2016 and 2019. Moreover, we propose a data collection and sampling pipeline, which is important for social media data sampling for human annotation. We report multiclass classification results using classic and deep learning (fastText and transformer) based models to set the ground for future studies. The dataset and associated resources are publicly available. https://crisisnlp.qcri.org/humaid_dataset.html

Citations (57)

View on Semantic Scholar

Summary

The paper introduces HumAID, a large-scale human-annotated Twitter dataset for disaster events designed to support deep learning research.
It details a rigorous multi-step data collection and filtering process ensuring high-quality, context-relevant tweets from major disasters.
Benchmark experiments show transformer models outperform classical methods, establishing strong baselines for crisis informatics.

This paper introduces HumAID, a large-scale, human-annotated dataset of Twitter messages related to disaster incidents, along with deep learning benchmark results for classifying these tweets (2104.03090). The motivation stems from the limitations of existing crisis informatics datasets, which are often small, contain duplicates, lack consistent annotation schemes, and are less suitable for training modern data-hungry deep learning models.

Dataset Creation:

Data Collection: Over 24 million tweets were collected using the AIDR system via the Twitter streaming API for 19 major disaster events (hurricanes, earthquakes, wildfires, floods) between 2016 and 2019. Collection used event-specific keywords and hashtags.
Filtering Pipeline: A multi-step filtering process was applied to enrich the dataset with relevant information, particularly from disaster-hit areas:
- Date-based: Restricted tweets to actual event dates.
- Location-based: Prioritized geo-tagged tweets, then place fields, and finally user-provided locations (resolved using Nominatim/OpenStreetMap) to keep only tweets from curated lists of disaster-hit locations.
- Language-based: Kept only English tweets identified by Twitter's metadata.
- Classifier-based: Used a Random Forest classifier (trained on existing humanitarian data) to remove tweets classified as "not-humanitarian".
- Word-count-based: Retained tweets with at least three words or hashtags (excluding URLs and numbers).
- Near-duplicate filtering: Removed exact and near-duplicate tweets based on cosine similarity (threshold > 0.75) of uni- and bi-gram vectors after removing URLs, mentions, etc.
Sampling: From the filtered pool (337,082 tweets), 109,612 tweets were randomly sampled for annotation, aiming for a fair distribution across predicted classes.
Annotation:
- Platform: Amazon Mechanical Turk (AMT).
- Scheme: Defined 11 categories based on humanitarian aid needs (e.g., "Caution and advice", "Infrastructure and utility damage", "Requests or urgent needs", "Rescue volunteering or donation effort", "Not humanitarian", etc.). Annotators assigned a single primary label per tweet.
- Quality Control: Implemented a qualification test (>=6/10 correct answers), used gold standard tweets within each HIT (Human Intelligence Task), required 70% accuracy on gold standards for assignment approval, and needed at least 2 out of 3 annotators (66% agreement) to agree on a label for a tweet to be included in the final dataset.
- Final Dataset: Resulted in HumAID, containing 77,196 annotated tweets with agreed-upon labels.
- Agreement: Achieved moderate to substantial inter-annotator agreement, with average Fleiss kappa of 0.55 and average Krippendorff's alpha of 0.57 across events.

Benchmarking Experiments:

Setup: Conducted multi-class classification experiments at three levels: individual event, event type (e.g., all earthquakes combined), and all data combined. Standard 70%/10%/20% train/dev/test splits were created and made publicly available. Near-duplicates across splits for combined datasets were removed.
Preprocessing: Included removal of stop words, non-ASCII characters, punctuation, numbers, URLs, and hashtag signs.
Models:
- Classical: Random Forest (RF), Support Vector Machines (SVM) with tf-idf weighted n-gram features (uni-, bi-, tri-grams).
- Deep Learning: fastText (with pre-trained Common Crawl embeddings), BERT, RoBERTa, XLM-RoBERTa, and DistilBERT (fine-tuned using the Transformers Toolkit).
Evaluation: Used weighted average Precision, Recall, and F1-score to account for class imbalance.
Results:
- Transformer-based models (BERT, RoBERTa, XLM-R, DistilBERT) consistently outperformed classical models (SVM, RF) and fastText across all experimental settings (event, event-type, combined).
- RoBERTa achieved the highest average F1-score (0.781), slightly outperforming XLM-R (0.777), BERT (0.768), and DistilBERT (0.769).
- DistilBERT offered performance comparable to BERT with fewer parameters, suggesting it as a practical choice for deployment.
- SVM generally performed slightly better than RF among classical models.

Contributions and Significance:

Provides the largest publicly available human-annotated Twitter dataset (HumAID) for crisis informatics, specifically tailored for humanitarian aid categories.
Details a rigorous data collection, filtering, and annotation pipeline designed to maximize data quality and relevance.
Offers standardized data splits (train/dev/test) to ensure reproducible research and fair comparison of future models.
Establishes strong baseline results using both classical and state-of-the-art transformer models, demonstrating the effectiveness of deep learning on this task.
The dataset and associated resources (code, splits) are made publicly available to advance research in disaster response and humanitarian aid.

The paper concludes that HumAID addresses key limitations of previous datasets and provides a valuable resource for developing and evaluating more sophisticated models for extracting actionable information from social media during crises.

PDF Markdown

HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks (2104.03090v2)

Summary

Related Papers