DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods (2403.05700v1)
Abstract: Social scientists increasingly use demographically stratified social media data to study the attitudes, beliefs, and behavior of the general public. To facilitate such analyses, we construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along with their bios and profile pictures. We enrich the user data with high-quality labels for gender, age, and location. DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users. In particular, we investigate if tweets contain valuable information for the task, since popular classifiers like M3 don't leverage them. Our best XLM-based classifier improves upon the commonly used competitor M3 by up to 53% F1. Especially for age prediction, classifiers profit from including tweets as features. We also confirm these findings on a German test set.
- Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In Proceedings of the International AAAI Conference on Web and Social Media, volume 6.1, pages 387–390.
- Who tweets in italian? demographic characteristics of twitter users. In New Statistical Developments in Data Science: SIS 2017, Florence, Italy, June 28-30, pages 329–344. Springer.
- Birds of a feather don’t fact-check each other: Partisanship and the evaluation of news in twitter’s birdwatch crowdsourced fact-checking program. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
- Language independent gender classification on twitter. In Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pages 739–743.
- Pablo Barberá. 2015. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political analysis, 23(1):76–91.
- Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond. arXiv preprint arXiv:2104.12250.
- Individuals with depression express more distorted thinking on social media. Nature human behaviour, 5(4):458–466.
- Twitter-demographer: A flow-based tool to enrich twitter data. arXiv preprint arXiv:2201.10986.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Demographics and topics impact on the co-spread of covid-19 misinformation and fact-checks on twitter. Information Processing & Management, 58(6):102732.
- Rochana Chaturvedi and Sugat Chaturvedi. 2023. It’s all in the name: A character-based approach to infer religion. Political Analysis, pages 1–16.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Aron Culotta. 2014. Estimating county health statistics with twitter. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1335–1344.
- The effect of social media on elections: Evidence from the united states. Forthcoming Journal of the European Economic Association.
- Kim Holmberg and Iina Hellsten. 2015. Gender differences in the climate change communication on twitter. Internet research, 25(5):811–828.
- Dirk Hovy. 2015. Demographic factors improve classification performance. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long papers), pages 752–762.
- Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602.
- Estimating geographic subjective well-being from twitter: A comparison of dictionary and data-driven language methods. Proceedings of the National Academy of Sciences, 117(19):10165–10171.
- Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 34–42.
- Towards robust and privacy-preserving text representations. arXiv preprint arXiv:1805.06093.
- Name-based demographic inference and the unequal distribution of misrecognition. Nature Human Behaviour, pages 1–12.
- Towards human-level text coding with llms: The case of fatherhood roles in public policy documents. arXiv preprint arXiv:2311.11844.
- Human centered nlp with user-factor adaptation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1146–1155.
- Predicting individual-level income from facebook profiles. PloS one, 14(3):e0214369.
- Pew. 2022. Jobs, hobbies top the list of things u.s. adults put in their twitter profiles; references to politics relatively rare, by regina widjaya. https://www.pewresearch.org/short-reads/2022/05/05/jobs-hobbies-top-the-list-of-things-u-s-adults-put-in-their-twitter-profiles-references-to-politics-relatively-rare/.
- Leveraging label variation in large language models for zero-shot text classification. arXiv preprint arXiv:2307.12973.
- Beyond binary labels: Political ideology prediction of twitter users. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), pages 729–740.
- Daniel Preoţiuc-Pietro and Lyle Ungar. 2018. User-level race and ethnicity predictors from twitter text. In Proceedings of the 27th international conference on computational linguistics, pages 1534–1545.
- Statista. 2023. Social media: Twitter users in italy. https://www.statista.com/study/73547/social-media-twitter-users-in-italy/.
- Zachary C Steinert-Threlkeld. 2018. Twitter as data. Cambridge University Press.
- Twitter makes it worse: Political journalists, gendered echo chambers, and the amplification of gender bias. The international journal of press/politics, 23(3):324–344.
- Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1815–1827.
- Demographic inference and representative population estimates from multilingual social media data. In The world wide web conference, pages 2056–2067.
- The eyes of the beholder: Gender prediction using images posted in online social networks. In 2014 IEEE International Conference on Data Mining Workshop, pages 1026–1030.
- Monitoring depression trends on twitter during the covid-19 pandemic: observational study. JMIR infodemiology, 1(1):e26769.
- Lorenzo Lupo (5 papers)
- Paul Bose (1 paper)
- Mahyar Habibi (4 papers)
- Dirk Hovy (57 papers)
- Carlo Schwarz (2 papers)