Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods (2403.05700v1)

Published 8 Mar 2024 in cs.CL

Abstract: Social scientists increasingly use demographically stratified social media data to study the attitudes, beliefs, and behavior of the general public. To facilitate such analyses, we construct, validate, and release publicly the representative DADIT dataset of 30M tweets of 20k Italian Twitter users, along with their bios and profile pictures. We enrich the user data with high-quality labels for gender, age, and location. DADIT enables us to train and compare the performance of various state-of-the-art models for the prediction of the gender and age of social media users. In particular, we investigate if tweets contain valuable information for the task, since popular classifiers like M3 don't leverage them. Our best XLM-based classifier improves upon the commonly used competitor M3 by up to 53% F1. Especially for age prediction, classifiers profit from including tweets as features. We also confirm these findings on a German test set.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In Proceedings of the International AAAI Conference on Web and Social Media, volume 6.1, pages 387–390.
  2. Who tweets in italian? demographic characteristics of twitter users. In New Statistical Developments in Data Science: SIS 2017, Florence, Italy, June 28-30, pages 329–344. Springer.
  3. Birds of a feather don’t fact-check each other: Partisanship and the evaluation of news in twitter’s birdwatch crowdsourced fact-checking program. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
  4. Language independent gender classification on twitter. In Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining, pages 739–743.
  5. Pablo Barberá. 2015. Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political analysis, 23(1):76–91.
  6. Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond. arXiv preprint arXiv:2104.12250.
  7. Individuals with depression express more distorted thinking on social media. Nature human behaviour, 5(4):458–466.
  8. Twitter-demographer: A flow-based tool to enrich twitter data. arXiv preprint arXiv:2201.10986.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  10. Demographics and topics impact on the co-spread of covid-19 misinformation and fact-checks on twitter. Information Processing & Management, 58(6):102732.
  11. Rochana Chaturvedi and Sugat Chaturvedi. 2023. It’s all in the name: A character-based approach to infer religion. Political Analysis, pages 1–16.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  13. Aron Culotta. 2014. Estimating county health statistics with twitter. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 1335–1344.
  14. The effect of social media on elections: Evidence from the united states. Forthcoming Journal of the European Economic Association.
  15. Kim Holmberg and Iina Hellsten. 2015. Gender differences in the climate change communication on twitter. Internet research, 25(5):811–828.
  16. Dirk Hovy. 2015. Demographic factors improve classification performance. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long papers), pages 752–762.
  17. Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602.
  18. Estimating geographic subjective well-being from twitter: A comparison of dictionary and data-driven language methods. Proceedings of the National Academy of Sciences, 117(19):10165–10171.
  19. Gil Levi and Tal Hassner. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 34–42.
  20. Towards robust and privacy-preserving text representations. arXiv preprint arXiv:1805.06093.
  21. Name-based demographic inference and the unequal distribution of misrecognition. Nature Human Behaviour, pages 1–12.
  22. Towards human-level text coding with llms: The case of fatherhood roles in public policy documents. arXiv preprint arXiv:2311.11844.
  23. Human centered nlp with user-factor adaptation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1146–1155.
  24. Predicting individual-level income from facebook profiles. PloS one, 14(3):e0214369.
  25. Pew. 2022. Jobs, hobbies top the list of things u.s. adults put in their twitter profiles; references to politics relatively rare, by regina widjaya. https://www.pewresearch.org/short-reads/2022/05/05/jobs-hobbies-top-the-list-of-things-u-s-adults-put-in-their-twitter-profiles-references-to-politics-relatively-rare/.
  26. Leveraging label variation in large language models for zero-shot text classification. arXiv preprint arXiv:2307.12973.
  27. Beyond binary labels: Political ideology prediction of twitter users. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), pages 729–740.
  28. Daniel Preoţiuc-Pietro and Lyle Ungar. 2018. User-level race and ethnicity predictors from twitter text. In Proceedings of the 27th international conference on computational linguistics, pages 1534–1545.
  29. Statista. 2023. Social media: Twitter users in italy. https://www.statista.com/study/73547/social-media-twitter-users-in-italy/.
  30. Zachary C Steinert-Threlkeld. 2018. Twitter as data. Cambridge University Press.
  31. Twitter makes it worse: Political journalists, gendered echo chambers, and the amplification of gender bias. The international journal of press/politics, 23(3):324–344.
  32. Exploring demographic language variations to improve multilingual sentiment analysis in social media. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1815–1827.
  33. Demographic inference and representative population estimates from multilingual social media data. In The world wide web conference, pages 2056–2067.
  34. The eyes of the beholder: Gender prediction using images posted in online social networks. In 2014 IEEE International Conference on Data Mining Workshop, pages 1026–1030.
  35. Monitoring depression trends on twitter during the covid-19 pandemic: observational study. JMIR infodemiology, 1(1):e26769.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lorenzo Lupo (5 papers)
  2. Paul Bose (1 paper)
  3. Mahyar Habibi (4 papers)
  4. Dirk Hovy (57 papers)
  5. Carlo Schwarz (2 papers)

Summary

DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods

The paper, titled "DADIT: A Dataset for Demographic Classification of Italian Twitter Users and a Comparison of Prediction Methods," introduces a new dataset named DADIT, which encompasses around 30 million tweets from 20,000 Italian Twitter users. The dataset is enriched with high-quality demographic labels for gender, age, and location, addressing crucial needs in social science research that relies on stratified social media data.

Dataset Overview

DADIT is a robust and representative dataset tailored for demographic analysis. It includes not only the tweets of the users but also their bios and profile pictures. The high-quality demographic labels included for gender and age have undergone rigorous manual verification to ensure their accuracy. Gender labels were derived based on the full name fields, leveraging the specificity of Italian naming conventions. Age labels were generated using regex patterns that match statements of age or birth year in the user's bio or tweets. The dataset has proven to be representative of the broader Italian Twitter user base in terms of demographic characteristics.

Methodology for Gender and Age Classification

The paper highlights the performance of various models trained on this dataset for predicting user demographics, with a focus on gender and age. The primary models considered include:

  • M3 Classifier: A state-of-the-art multimodal model that uses user bios, profile pictures, and optionally usernames for demographic prediction. However, it struggles in the absence of certain profile information.
  • CV (Computer Vision) Model: Utilizes profile pictures to infer demographic attributes.
  • XLM (Transformer Model): A classifier based on a fine-tuned twitter-XLM-roberta-base, which makes extensive use of users' bios and tweets for prediction.
  • Flan-T5 and GPT3.5: State-of-the-art LLMs tested in zero-shot and few-shot settings to classify demographics based on text inputs.

The findings suggest that incorporating tweet data significantly improves model performance, particularly for age classification. The best-performing model was the XLM-based classifier, which achieved an F1-score improvement of up to 53% over the M3 classifier for age prediction and remained effective even when evaluated on German Twitter data.

Experimental Results

The experimental results underscore three crucial aspects:

  1. Significant Improvement with Tweet Inclusion: Both gender and age classifiers exhibited remarkable performance gains when tweets were included as features. Specifically, the fine-tuned XLM model outperformed the M3 model significantly, achieving higher F1-scores across tasks.
  2. Model Robustness: The XLM model trained on Italian data generalized well to the German dataset, highlighting the model's robustness and the value of multilingual, demographically annotated datasets like DADIT.
  3. Ensemble Learning: Further performance gains, particularly in gender classification, were observed through ensemble methods, combining the predictions from XLM and M3 models. However, such gains were not evident in age predictions.

Implications and Future Directions

The construction and release of DADIT have broad implications for computational social science and NLP research. The dataset provides a rich resource for developing models that require robust demographic information. Its potential extends beyond Italy, as demonstrated by the successful application of models trained on Italian data to German Twitter users.

Future research could explore the following avenues:

  • Enhanced Multimodal Approaches: Further improving multimodal models by training vision models on DADIT directly, rather than relying on pre-trained weights.
  • Advanced Ensemble Methods: Developing sophisticated ensemble techniques that synergize text and image models more effectively.
  • Broader Application: Extending demographic prediction models to other languages and cultural contexts using similar datasets.

Conclusion

DADIT not only fills a critical gap in resources needed for demographic analysis on social media but also provides evidence on the importance of integrating tweet content for predicting demographic attributes accurately. The paper demonstrates that modern LLMs fine-tuned on specific datasets can significantly outperform traditional multimodal approaches, paving the way for more accurate and inclusive social media analytics.

The release of this dataset and the accompanying findings will substantially aid researchers in computational social science, NLP, and related fields, providing them with the necessary tools to stratify user data demographically and conduct nuanced analyses.

Ultimately, the paper makes a compelling case for the integration of text data into demographic prediction models, significantly enhancing our ability to understand and analyze the rich tapestry of human behaviors manifested on social media platforms.