Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19 (2005.06012v4)

Published 2 May 2020 in cs.SI and cs.CL

Abstract: We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Muhammad Abdul-Mageed (102 papers)
  2. AbdelRahim Elmadany (33 papers)
  3. El Moatez Billah Nagoudi (31 papers)
  4. Dinesh Pabbi (1 paper)
  5. Kunal Verma (4 papers)
  6. Rannie Lin (1 paper)
Citations (19)

Summary

We haven't generated a summary for this paper yet.