Surveying the Dead Minds: Historical-Psychological Text Analysis with Contextualized Construct Representation (CCR) for Classical Chinese (2403.00509v1)
Abstract: In this work, we develop a pipeline for historical-psychological text analysis in classical Chinese. Humans have produced texts in various languages for thousands of years; however, most of the computational literature is focused on contemporary languages and corpora. The emerging field of historical psychology relies on computational techniques to extract aspects of psychology from historical corpora using new methods developed in NLP. The present pipeline, called Contextualized Construct Representations (CCR), combines expert knowledge in psychometrics (i.e., psychological surveys) with text representations generated via transformer-based LLMs to measure psychological constructs such as traditionalism, norm strength, and collectivism in classical Chinese corpora. Considering the scarcity of available data, we propose an indirect supervised contrastive learning approach and build the first Chinese historical psychology corpus (C-HI-PSY) to fine-tune pre-trained models. We evaluate the pipeline to demonstrate its superior performance compared with other approaches. The CCR method outperforms word-embedding-based approaches across all of our tasks and exceeds prompting with GPT-4 in most tasks. Finally, we benchmark the pipeline against objective, external data to further verify its validity.
- Perils and opportunities in using large language models in psychological research.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Morality beyond the weird: How the nomological network of morality varies across cultures. Journal of Personality and Social Psychology, 125(5):1157–1188.
- Mohammad Atari and Joseph Henrich. 2023. Historical psychology. Current Directions in Psychological Science, 32(2):176–183.
- Contextualized construct representation: Leveraging psychometric scales to advance theory-driven text analysis.
- David Bamman and Patrick J. Burns. 2020. Latin bert: A contextual language model for classical philology. ArXiv, abs/2009.10053.
- Cognitive fossils: using cultural artifacts to reconstruct psychological changes throughout history. Trends in Cognitive Sciences, 28(2):172–186.
- Over-reliance on english hinders cognitive science. Trends in Cognitive Sciences, 26(12):1153–1170.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- The development and psychometric properties of liwc-22. Austin, TX: University of Texas at Austin, pages 1–47.
- Ryan L Boyd and H Andrew Schwartz. 2021. Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field. Journal of Language and Social Psychology, 40(1):21–41.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- When danger strikes: A linguistic tool for tracking america’s collective response to threats. Proceedings of the National Academy of Sciences, 119(4):e2113891119.
- Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1.
- DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States. Association for Computational Linguistics.
- Using large language models in psychology. Nature Reviews Psychology, 2(11):688–701.
- Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Mapping the connections between politics and morality: The multiple sociopolitical orientations involved in moral intuition. Political Psychology, 34(4):589–610.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis: Distributed dictionary representation. Behavior research methods, 50:344–361.
- Differences between tight and loose cultures: A 33-nation study. Science, 332(6033):1100–1104.
- Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology, 96(5):1029.
- Mapping moral language on us presidential primary campaigns reveals rhetorical networks of political division and unity. PNAS nexus, page pgad189.
- Foreseeing the benefits of incidental supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1782–1800, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- From text to thought: How analyzing language can advance psychological science. Perspectives on Psychological Science, 17(3):805–826.
- The Classical Language Toolkit: An NLP framework for pre-modern languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 20–29, Online. Association for Computational Linguistics.
- John T Jost and Orsolya Hunyady. 2005. Antecedents and consequences of system-justifying ideologies. Current directions in psychological science, 14(5):260–265.
- Discourse-level representations can improve prediction of degree of anxiety. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1500–1511.
- Moral concerns are differentially observable in language. Cognition, 212:104696.
- Moral foundations and political orientation: Systematic review and meta-analysis. Psychological Bulletin, 147(1):55.
- Tracing the threads: How five moral concerns (especially purity) help explain culture war attitudes. Journal of research in personality, 46(2):184–194.
- Automatic biographical information extraction from local gazetteers with bi-lstm-crf model and bert. International Journal of Digital Humanities, 4(1-3):195–212.
- Enrique Manjavacas Arevalo and Lauren Fonteyn. 2021. MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450-1950). In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 23–36, NIT Silchar, India. NLP Association of India (NLPAI).
- Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
- Psychology as a historical science. Annual Review of Psychology, 72(1):717–749.
- Comprehensive stereotype content dictionaries using a semi-automated method. European Journal of Social Psychology, 51(1):178–196.
- The psychological causes and societal consequences of authoritarianism. Nature Reviews Psychology, 2(4):220–232.
- Daphna Oyserman. 1993. The lens of personhood: Viewing the self and others in a multicultural society. Journal of Personality and Social Psychology, 65(5):993–1009.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- QuoteR: A benchmark of quote recommendation for writing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 336–348, Dublin, Ireland. Association for Computational Linguistics.
- Wantwords: An open-source online reverse dictionary system. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–181.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Greater traditionalism predicts covid-19 precautionary behaviors across 27 societies. Scientific Reports, 13(1).
- Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823.
- Depression at work: exploring depression in major us companies from online reviews. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–21.
- Prompting gpt-3 to be reliable. In International Conference on Learning Representations (ICLR).
- Troll and divide: the language of online polarization. PNAS nexus, 1(1):pgac019.
- Edward Slingerland. 2013. Body and mind in early china: An integrated humanities–science approach. Journal of the American Academy of Religion, 81(1):6–55.
- The distant reading of religious texts: A “big data” approach to mind-body concepts in early china. Journal of the American Academy of Religion, 85(4):985–1016.
- Daniel Swanson and Francis Tyers. 2022. Handling stress in finite-state morphological analyzers for Ancient Greek and Ancient Hebrew. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 108–113, Marseille, France. European Language Resources Association.
- Anchibert: A pre-trained model for ancient chinese language understanding and generation. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
- Attention is all you need. Advances in neural information processing systems, 30.
- Gujibert and gujigpt: Construction of intelligent information processing foundation language models for ancient texts.
- Pre-trained model in Ancient-Chinese-to-Modern-Chinese machine translation. In Proceedings of ALT2023: Ancient Language Translation Workshop, pages 23–28, Macau SAR, China. Asia-Pacific Association for Machine Translation.
- Pengyu Wang and Zhichen Ren. 2022. The uncertainty-based retrieval framework for ancient chinese cws and pos. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 164–168.
- Sze-Yuh Nina Wang and Yoel Inbar. 2021. Moral-language use by us political elites. Psychological Science, 32(1):14–26.
- Yuhua Wang. 2022. Blood is thicker than water: Elite kinship networks and state building in imperial china. American Political Science Review, 116(3):896–910.
- John Wilkerson and Andreu Casas. 2017. Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, 20:529–544.
- Can NLI provide proper indirect supervision for low-resource biomedical relation extraction? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2450–2467, Toronto, Canada. Association for Computational Linguistics.
- Tracking moral divergence with ddr in presidential debates over 60 years. Journal of Computational Social Science, 6(1):339–357.
- Ming Xu. 2023. Text2vec: Text to vector toolkit. https://github.com/shibing624/text2vec.
- Tan Yan and Zewen Chi. 2020. Guwenbert. urlhttps://github.com/ethan-yt/guwenbert.
- Tal Yarkoni and Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6):1100–1122.
- Indirectly supervised natural language processing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 32–40, Toronto, Canada. Association for Computational Linguistics.
- Automatic translation alignment for Ancient Greek and Latin. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 101–107, Marseille, France. European Language Resources Association.
- Zaozhuang Zeng and Lin Liu, editors. 2006. Complete Prose of the Song Dynasty, volume 360. Shanghai cishu chubanshe and Anhui jiaoyu chubanshe, Shanghai and Hefei. In Chinese.
- Multi-channel reverse dictionary model. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 312–319.
- The evolution of romantic love in chinese fiction in the very long run (618 - 2022): A quantitative approach. In Workshop on Computational Humanities Research.
- WYWEB: A NLP evaluation benchmark for classical Chinese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3294–3319, Toronto, Canada. Association for Computational Linguistics.
- The language of situational empathy. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–19.