Short text classification with machine learning in the social sciences: The case of climate change on Twitter (2310.04452v1)
Abstract: To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
- Denny MJ, Spirling A. Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis. 2018;26(2):168–189.
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html, accessed Apr. 2023.
- Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information processing & management. 1988;24(5):513–523.
- https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer, accessed Apr. 2023.
- Jurafsky D. Speech & language processing. Pearson Education India; 2000.
- McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153–157.
- Pearson K. On the theory of contingency and its relation to association and normal correlation. In: Drapers’ Company research memoirs. Biometric series. vol. 1. Dalau and Co.; 1904.
- Shirky C. Here Comes Everybody: The Power of Organizing without Organizations. Penguin Books; 2008.
- Climate change sentiment on Twitter: An unsolicited public opinion poll. PLoS ONE. 2015;10(8):e0136092.
- Network analysis reveals open forums and echo chambers in social media discussions of climate change. Global Environmental Change. 2015;32:126–138.
- Ecker-Ehrhardt M. IO public communication going digital? Understanding social media adoption and use in times of politicization. In: Bjola C, Zaiotti R, editors. Digital Diplomacy and International Organisations Autonomy, Legitimacy and Contestation. 1st ed. Routledge; 2020. p. 21–51.
- Sebők M, Kacsuk Z. The multiclass classification of newspaper articles with machine learning: The hybrid binary snowball approach. Political Analysis. 2021;29(2):236––249.
- Hall N. What is adaptation to climate change? Epistemic ambiguity in the climate finance system. International Environmental Agreements. 2017;17:37–53.
- Persson Å. Global adaptation governance: An emerging but contested domain. WIREs Climate Change. 2019;10(6):1–18.
- Lexicon-based methods for sentiment analysis. Computational Linguistics. 2011;37(2):267–307.
- Quantising opinions for political tweets analysis. In: Proceedings of the 8th International Conference on Language Resources and Evaluation; 2012. p. 3901–3906.
- Dictionary-based classification of tweets about environment. Journal of Mathematics and Statistical Science. 2020;8(1).
- Latent Dirichlet allocation. Journal of Machine Learning Research. 2003;3(Jan):993–1022.
- Joachims T. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the European Conference on Machine Learning; 1998. p. 137–142.
- Separating the wheat from the chaff: Applications of automated document classification using support vector machines. Political Analysis. 2014;22(2):224–242.
- Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 1026–1034.
- Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nature Communications. 2020;11(1):1–15.
- Deep learning. Nature. 2015;521(7553):436–444.
- Grimmer J, Stewart BM. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis. 2013;21(3):267–297.
- Climate change communication from cities in the USA. Climatic Change. 2018;149(2):173–187.
- Machine learning human rights and wrongs: How the successes and failures of supervised learning algorithms can inform the debate about information effects. Political Analysis. 2019;27(2):223–230.
- Exploring climate change on Twitter using seven aspects: Stance, sentiment, aggressiveness, temperature, gender, topics, and disasters. PLoS ONE. 2022;17(9).
- Growing polarization around climate change on social media. Nature Climate Change. 2022;12:1114––1121.
- Jang SM, Hart PS. Polarized frames on “climate change” and “global warming” across countries and states: Evidence from Twitter big data. Global Environmental Change. 2015;32:11–17.
- Who tweets climate change papers? Investigating publics of research through users’ descriptions. PLoS ONE. 2022;17(6):e0268999.
- Scientific networks on Twitter: Analyzing scientists’ interactions in the climate change debate. Public Understanding of Science. 2019;28(6):696–712.
- International organizations and climate change adaptation: A new dataset for the social scientific study of adaptation, 1990–2017. PloS ONE. 2021;16(9):e0257101.
- Dellmuth L, Gustafsson MT. Global adaptation governance: How intergovernmental organizations mainstream climate change adaptation. Climate Policy. 2021;21(7):1–16.
- SciKit-Learn: Machine learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830.
- https://keras.io, accessed Apr. 2023.
- Verhulst PF. Notice sur la loi que la population suit dans son accroissement. Correspondance Mathématique et Physique. 1838;10:113–126.
- Zhang T, Oles FJ. Text categorization based on regularized linear classification methods. Information Retrieval. 2001;4(1):5–31.
- Logistic regression and text classification. In: Gaussier É, Yvon F, editors. Textual Information Access: Statistical Model. John Wiley & Sons; 2012. p. 61–84.
- Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297.
- Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press; 2002.
- Lampert CH. Kernel Methods in Computer Vision. Now Publishers Inc.; 2009.
- Kwok SW, Carter C. Multiple decision trees. In: Shchter RD, Levitt TS, Kanal LN, Lemmer JF, editors. Machine Intelligence and Pattern Recognition. vol. 9. Elsevier; 1990. p. 327–335.
- Breiman L. Random forests. Machine Learning. 2001;45(1):5–32.
- Short text classification using semantic random forest. In: Proceedings of the International Conference on Data Warehousing and Knowledge Discovery; 2014. p. 288–299.
- ForesTexter: An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based Systems. 2014;67:105–116.
- Quinlan JR. Induction of decision trees. Machine Learning. 1986;1(1):81–106.
- Automated learning of decision rules for text categorization. ACM Transactions on Information Systems. 1994;12(3):233–251.
- Breiman L. Bagging predictors. Machine Learning. 1996;24(2):123–140.
- Weiss SM, Kulikowski CA. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers Inc.; 1991.
- Classifying news stories using memory based reasoning. In: Proceedings of the 15th Annual International ACM Conference on Research and Development in Information Retrieval; 1992. p. 59–65.
- KNN with TF-IDF based framework for text categorization. Procedia Engineering. 2014;69:1356–1364.
- Domingos P, Pazzani M. Beyond independence: Conditions for the optimality of the simple Bayesian classifier. In: Proceedings of the 13th International Conference on Machine Learning; 1996. p. 105–112.
- McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI Workshop on Learning for Text Categorization. vol. 752; 1998. p. 41–48.
- Toward optimal feature selection in naive Bayes for text categorization. IEEE Transactions on Knowledge and Data Engineering. 2016;28(9):2508–2521.
- Hebb DO. The Organization of Behavior. New York: Wiley; 1949.
- Bain A. Mind and Body: The Theories of their Relation. vol. 4. Appleton; 1873.
- The Principles of Psychology. vol. 1. Macmillan London; 1890.
- McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics. 1943;5(4):115–133.
- Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems. 1989;2(4):303–314.
- A neural network approach to topic spotting. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval. vol. 317; 1995. p. 332.
- Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th International ACM Conference on Research and Development in Information Retrieval; 1997. p. 67–73.
- Kim Y. Convolutional Neural Networks for Sentence Classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing; 2014. p. 1–6.
- Torres M, Cantú F. Learning to see: Convolutional neural networks for the analysis of social science data. Political Analysis. 2021; p. 1–19.
- Rosenblatt F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological review. 1958;65(6):386.
- Werbos P. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University; 1974.
- Kingma DP, Ba J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations; 2015. p. 1–15.
- Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine learning; 2008. p. 160–167.
- Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9(8):1735–1780.
- Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks. 2005;18(5-6):602–610.
- Shyrokykh K. Replication codes and data for: Short Text Classification with Machine Learning in the Social Sciences; 2023. https://github.com/shikarina/short_text_classification.
- https://zenodo.org/record/7633599/#.Y-lbTS8w1qs, accessed Apr. 2023.
- Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20(1):37–46.
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977; p. 159–174.
- Mockus J. The application of Bayesian methods for seeking the extremum. Towards global optimization. 1998;2:117.
- Bergstra J, Bengio Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research. 2012;13(2).
- Aszemi NM, Dominic P. Hyperparameter optimization in convolutional neural network using genetic algorithms. International Journal of Advanced Computer Science and Applications. 2019;10(6).
- Fischetti M, Stringher M. Embedding simulated annealing within stochastic gradient descent. In: Proceedings of the 4th International Conference on Optimization and Learning; 2021. p. 3–13.
- On the performance of differential evolution for hyperparameter tuning. In: Proceedings of the International Joint Conference on Neural Networks; 2019. p. 1–8.
- Application of natural language processing and machine learning boosted with Swarm Intelligence for spam email filtering. Mathematics. 2022;10(22):4173.
- Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems. 2011;24.
- Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics; 2010. p. 249–256.
- Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE. 2019;14(8):e0220976.
- Nandanwar AK, Choudhary J. Semantic features with contextual knowledge-based web page categorization using the GloVe model and stacked BiLSTM. Symmetry. 2021;13(10):1772.
- Google. Colaboratory; 2023. https://research.google.com/colaboratory, verified Apr. 2023.
- He H, Garcia EA. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2009;21(9):1263–1284.
- Fawcett T. ROC graphs: Notes and practical considerations for researchers. Machine Learning. 2004;31(1):1–38.
- Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine learning; 2006. p. 233–240.
- Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure. 1975;405(2):442–451.
- Optimal thresholding of classifiers to maximize F1 measure. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; 2014. p. 225–239.
- Kaggle. Twitter Climate Change Sentiment Dataset; 2019. https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset, accessed Nov. 2021.
- Comparing automated text classification methods. International Journal of Research in Marketing. 2019;36(1):20–38.
- Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches. Political Analysis. 2020;28(4):532–551.
- Efficient estimation of word representations in vector space. In: Proceedings of International Conference on Learning Representations; 2013. p. 1–12.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019. p. 4171–4186.
- Karina Shyrokykh (1 paper)
- Maksym Girnyk (3 papers)
- Lisa Dellmuth (1 paper)