EcoVerse: An Annotated Twitter Dataset for Eco-Relevance Classification, Environmental Impact Analysis, and Stance Detection (2404.05133v1)

Published 8 Apr 2024 in cs.CL

Abstract: Anthropogenic ecological crisis constitutes a significant challenge that all within the academy must urgently face, including the NLP community. While recent years have seen increasing work revolving around climate-centric discourse, crucial environmental and ecological topics outside of climate change remain largely unaddressed, despite their prominent importance. Mainstream NLP tasks, such as sentiment analysis, dominate the scene, but there remains an untouched space in the literature involving the analysis of environmental impacts of certain events and practices. To address this gap, this paper presents EcoVerse, an annotated English Twitter dataset of 3,023 tweets spanning a wide spectrum of environmental topics. We propose a three-level annotation scheme designed for Eco-Relevance Classification, Stance Detection, and introducing an original approach for Environmental Impact Analysis. We detail the data collection, filtering, and labeling process that led to the creation of the dataset. Remarkable Inter-Annotator Agreement indicates that the annotation scheme produces consistent annotations of high quality. Subsequent classification experiments using BERT-based models, including ClimateBERT, are presented. These yield encouraging results, while also indicating room for a model specifically tailored for environmental texts. The dataset is made freely available to stimulate further research.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces EcoVerse, a manually annotated dataset of 3,023 tweets to explore environmental discourse beyond climate change.
It details a rigorous annotation scheme covering eco-relevance, environmental impact (positive/neutral/negative), and stance detection with high inter-annotator agreement.
Benchmarking with models like RoBERTa and DistilRoBERTa demonstrates competitive performance while revealing potential biases from hashtags.

This paper introduces EcoVerse, a manually annotated dataset of 3,023 English tweets designed to facilitate research on environmental discourse beyond the typical focus on climate change. The dataset supports three NLP tasks: Eco-Relevance Classification, Environmental Impact Analysis, and Stance Detection (2404.05133). The motivation stems from the observation that while NLP work on climate change exists, broader environmental topics and the analysis of ecological impacts (beneficial or harmful) are underexplored.

Dataset Creation:

Data Source: Tweets were collected from Twitter's API between January 2019 and June 2023.
Collection Strategy: A mix of sources was used to ensure diversity:
- General environmental hashtags (#environment).
- Hashtags potentially indicating skepticism (#climatescam, #ecoterrorism).
- Mainstream news sources (@telegraph, @nytimes, @business) for potentially non-eco-related content.
- Environmental organizations and publications (@natgeo, @Sierra_Magazine, @nature_org) for diverse eco-related content.
Cleaning: The collected data underwent filtering (removing retweets, duplicates, non-English, short tweets < 24 words, tweets with only emojis/hashtags/tags/URLs), formatting (removing line breaks, replacing links with "[URL]"), and deduplication using MinHash LSH (similarity threshold 0.2).
Sampling: From 21,244 cleaned tweets, 3,023 were selected for annotation, aiming for a balanced representation across the source types ("Environmental organizations", "Likely not eco-related", "Likely eco-related", "Likely skeptical") and time.

Annotation:

Scheme: A three-level scheme was developed:
- Level 1: Eco-Relevance: Is the tweet eco-related or not eco-related?
- Level 2: Environmental Impact Analysis: (Only for eco-related tweets) Does the tweet describe events/behaviors with positive, neutral, or negative environmental impact? This is presented as a novel task. Tweets expressing general opinions without clear events might be left untagged at this level.
- Level 3: Stance Detection: (Only for eco-related tweets) Is the author's stance towards environmental causes supportive, neutral, or skeptical/opposing?
Process: Two annotators with environmental domain knowledge used Label Studio. An iterative process involving pilot annotation, discussion, guideline refinement, main annotation, and disagreement resolution was followed.
Agreement: High Inter-Annotator Agreement (Cohen's Kappa) was achieved after discussion: 0.94 for Eco-Relevance, 0.82 for Environmental Impact, and 0.86 for Stance Detection.

Dataset Characteristics:

The dataset is relatively balanced for Eco-Relevance.
Environmental Impact labels show a reasonable distribution, though precise balance varies slightly between annotators.
The Stance Detection labels are skewed towards supportive, often because tweets reporting on environmental issues (positive or negative) tend to adopt a concerned or advocacy tone.
It covers a wide range of environmental topics identified manually, including Biodiversity, Climate Change, Energy Sources, Deforestation, Policy, Activism, Pollution, Sustainability, etc.

Experiments:

Goal: Establish benchmark performance on the three tasks using the annotated dataset (Annotator II's labels used).
Models: Fine-tuned BERT-based models: BERT-base, RoBERTa-base, DistilRoBERTa-base, ClimateBertF, ClimateBerts, ClimateBerts+D.
Setup: 80/10/10 train/eval/test split, standard hyperparameters (LR 3e-5, batch 16, 10 epochs, AdamW).
Metrics: Accuracy (micro - Am), Macro Precision (PM), Macro Recall (RM), Macro F1-score (F1M).
Results:
- Eco-Relevance: DistilRoBERTa achieved the best accuracy (89.43%).
- Environmental Impact: ClimateBerts performed best (Accuracy 78.62%, F1M 54.67%). Models struggled with the 'neutral' class.
- Stance Detection: RoBERTa and DistilRoBERTa had the highest accuracy (~81.3%). BERT achieved the best F1M (95.56%). ClimateBERT models generally underperformed standard BERT/RoBERTa on this task.
Bias Analysis: Removing the #climatescam hashtag confirmed its strong influence on classifying skeptical/opposing stance tweets, causing a ~4.7% drop in Stance Detection accuracy for the best model.

Implementation Considerations:

The dataset (available on GitHub: https://github.com/GioSira/EcoVerse.git) can be used to train classifiers for environmental monitoring, public opinion analysis, or tracking specific environmental narratives on social media.
The Environmental Impact Analysis task offers a new dimension for NLP applications, potentially useful for identifying real-world events or practices with ecological consequences reported in text.
Pre-processing steps like replacing URLs and user mentions (@) are recommended before model training.
The benchmark results suggest that standard models like RoBERTa or DistilRoBERTa perform competitively, and domain-specific models like ClimateBERT may not always offer advantages, especially for broader environmental topics or stance detection.
The observed bias related to #climatescam highlights the importance of considering potential shortcut learning when training models on social media data. Techniques to mitigate this, such as removing specific hashtags or using more robust models, might be necessary depending on the application.
The relatively small dataset size (3k tweets) means models might benefit from techniques like few-shot learning or transfer learning from larger, more general text corpora.

The paper concludes by highlighting EcoVerse as a valuable resource for broadening NLP research into diverse environmental topics and introducing environmental impact analysis. Future work includes expanding the dataset and developing specialized LLMs for environmental texts. They also report conducting the experiments with carbon tracking (using CodeCarbon), estimating a total emission of 0.24 kg CO2.