Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Surveying the Dead Minds: Historical-Psychological Text Analysis with Contextualized Construct Representation (CCR) for Classical Chinese (2403.00509v1)

Published 1 Mar 2024 in cs.CL, cs.AI, and cs.CY

Abstract: In this work, we develop a pipeline for historical-psychological text analysis in classical Chinese. Humans have produced texts in various languages for thousands of years; however, most of the computational literature is focused on contemporary languages and corpora. The emerging field of historical psychology relies on computational techniques to extract aspects of psychology from historical corpora using new methods developed in NLP. The present pipeline, called Contextualized Construct Representations (CCR), combines expert knowledge in psychometrics (i.e., psychological surveys) with text representations generated via transformer-based LLMs to measure psychological constructs such as traditionalism, norm strength, and collectivism in classical Chinese corpora. Considering the scarcity of available data, we propose an indirect supervised contrastive learning approach and build the first Chinese historical psychology corpus (C-HI-PSY) to fine-tune pre-trained models. We evaluate the pipeline to demonstrate its superior performance compared with other approaches. The CCR method outperforms word-embedding-based approaches across all of our tasks and exceeds prompting with GPT-4 in most tasks. Finally, we benchmark the pipeline against objective, external data to further verify its validity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Perils and opportunities in using large language models in psychological research.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Morality beyond the weird: How the nomological network of morality varies across cultures. Journal of Personality and Social Psychology, 125(5):1157–1188.
  4. Mohammad Atari and Joseph Henrich. 2023. Historical psychology. Current Directions in Psychological Science, 32(2):176–183.
  5. Contextualized construct representation: Leveraging psychometric scales to advance theory-driven text analysis.
  6. David Bamman and Patrick J. Burns. 2020. Latin bert: A contextual language model for classical philology. ArXiv, abs/2009.10053.
  7. Cognitive fossils: using cultural artifacts to reconstruct psychological changes throughout history. Trends in Cognitive Sciences, 28(2):172–186.
  8. Over-reliance on english hinders cognitive science. Trends in Cognitive Sciences, 26(12):1153–1170.
  9. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  10. The development and psychometric properties of liwc-22. Austin, TX: University of Texas at Austin, pages 1–47.
  11. Ryan L Boyd and H Andrew Schwartz. 2021. Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field. Journal of Language and Social Psychology, 40(1):21–41.
  12. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  13. When danger strikes: A linguistic tool for tracking america’s collective response to threats. Proceedings of the National Academy of Sciences, 119(4):e2113891119.
  14. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546 vol. 1.
  15. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States. Association for Computational Linguistics.
  16. Using large language models in psychology. Nature Reviews Psychology, 2(11):688–701.
  17. Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  18. Mapping the connections between politics and morality: The multiple sociopolitical orientations involved in moral intuition. Political Psychology, 34(4):589–610.
  19. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Dictionaries and distributions: Combining expert knowledge and large scale textual data content analysis: Distributed dictionary representation. Behavior research methods, 50:344–361.
  21. Differences between tight and loose cultures: A 33-nation study. Science, 332(6033):1100–1104.
  22. Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology, 96(5):1029.
  23. Mapping moral language on us presidential primary campaigns reveals rhetorical networks of political division and unity. PNAS nexus, page pgad189.
  24. Foreseeing the benefits of incidental supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1782–1800, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  25. From text to thought: How analyzing language can advance psychological science. Perspectives on Psychological Science, 17(3):805–826.
  26. The Classical Language Toolkit: An NLP framework for pre-modern languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 20–29, Online. Association for Computational Linguistics.
  27. John T Jost and Orsolya Hunyady. 2005. Antecedents and consequences of system-justifying ideologies. Current directions in psychological science, 14(5):260–265.
  28. Discourse-level representations can improve prediction of degree of anxiety. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1500–1511.
  29. Moral concerns are differentially observable in language. Cognition, 212:104696.
  30. Moral foundations and political orientation: Systematic review and meta-analysis. Psychological Bulletin, 147(1):55.
  31. Tracing the threads: How five moral concerns (especially purity) help explain culture war attitudes. Journal of research in personality, 46(2):184–194.
  32. Automatic biographical information extraction from local gazetteers with bi-lstm-crf model and bert. International Journal of Digital Humanities, 4(1-3):195–212.
  33. Enrique Manjavacas Arevalo and Lauren Fonteyn. 2021. MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450-1950). In Proceedings of the Workshop on Natural Language Processing for Digital Humanities, pages 23–36, NIT Silchar, India. NLP Association of India (NLPAI).
  34. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
  35. Psychology as a historical science. Annual Review of Psychology, 72(1):717–749.
  36. Comprehensive stereotype content dictionaries using a semi-automated method. European Journal of Social Psychology, 51(1):178–196.
  37. The psychological causes and societal consequences of authoritarianism. Nature Reviews Psychology, 2(4):220–232.
  38. Daphna Oyserman. 1993. The lens of personhood: Viewing the self and others in a multicultural society. Journal of Personality and Social Psychology, 65(5):993–1009.
  39. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  40. QuoteR: A benchmark of quote recommendation for writing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 336–348, Dublin, Ireland. Association for Computational Linguistics.
  41. Wantwords: An open-source online reverse dictionary system. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–181.
  42. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  43. Greater traditionalism predicts covid-19 precautionary behaviors across 27 societies. Scientific Reports, 13(1).
  44. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823.
  45. Depression at work: exploring depression in major us companies from online reviews. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–21.
  46. Prompting gpt-3 to be reliable. In International Conference on Learning Representations (ICLR).
  47. Troll and divide: the language of online polarization. PNAS nexus, 1(1):pgac019.
  48. Edward Slingerland. 2013. Body and mind in early china: An integrated humanities–science approach. Journal of the American Academy of Religion, 81(1):6–55.
  49. The distant reading of religious texts: A “big data” approach to mind-body concepts in early china. Journal of the American Academy of Religion, 85(4):985–1016.
  50. Daniel Swanson and Francis Tyers. 2022. Handling stress in finite-state morphological analyzers for Ancient Greek and Ancient Hebrew. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 108–113, Marseille, France. European Language Resources Association.
  51. Anchibert: A pre-trained model for ancient chinese language understanding and generation. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
  52. Attention is all you need. Advances in neural information processing systems, 30.
  53. Gujibert and gujigpt: Construction of intelligent information processing foundation language models for ancient texts.
  54. Pre-trained model in Ancient-Chinese-to-Modern-Chinese machine translation. In Proceedings of ALT2023: Ancient Language Translation Workshop, pages 23–28, Macau SAR, China. Asia-Pacific Association for Machine Translation.
  55. Pengyu Wang and Zhichen Ren. 2022. The uncertainty-based retrieval framework for ancient chinese cws and pos. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 164–168.
  56. Sze-Yuh Nina Wang and Yoel Inbar. 2021. Moral-language use by us political elites. Psychological Science, 32(1):14–26.
  57. Yuhua Wang. 2022. Blood is thicker than water: Elite kinship networks and state building in imperial china. American Political Science Review, 116(3):896–910.
  58. John Wilkerson and Andreu Casas. 2017. Large-scale computerized text analysis in political science: Opportunities and challenges. Annual Review of Political Science, 20:529–544.
  59. Can NLI provide proper indirect supervision for low-resource biomedical relation extraction? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2450–2467, Toronto, Canada. Association for Computational Linguistics.
  60. Tracking moral divergence with ddr in presidential debates over 60 years. Journal of Computational Social Science, 6(1):339–357.
  61. Ming Xu. 2023. Text2vec: Text to vector toolkit. https://github.com/shibing624/text2vec.
  62. Tan Yan and Zewen Chi. 2020. Guwenbert. urlhttps://github.com/ethan-yt/guwenbert.
  63. Tal Yarkoni and Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspectives on Psychological Science, 12(6):1100–1122.
  64. Indirectly supervised natural language processing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 32–40, Toronto, Canada. Association for Computational Linguistics.
  65. Automatic translation alignment for Ancient Greek and Latin. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 101–107, Marseille, France. European Language Resources Association.
  66. Zaozhuang Zeng and Lin Liu, editors. 2006. Complete Prose of the Song Dynasty, volume 360. Shanghai cishu chubanshe and Anhui jiaoyu chubanshe, Shanghai and Hefei. In Chinese.
  67. Multi-channel reverse dictionary model. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 312–319.
  68. The evolution of romantic love in chinese fiction in the very long run (618 - 2022): A quantitative approach. In Workshop on Computational Humanities Research.
  69. WYWEB: A NLP evaluation benchmark for classical Chinese. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3294–3319, Toronto, Canada. Association for Computational Linguistics.
  70. The language of situational empathy. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1):1–19.
Citations (1)

Summary

  • The paper presents the novel CCR pipeline that integrates psychometrics with Transformer-based models to analyze psychological constructs in classical Chinese texts.
  • It employs indirect supervised contrastive learning and the creation of the C-HI-PSY corpus to address linguistic challenges.
  • Evaluation shows CCR outperforms traditional word-embedding methods and GPT-4 prompting in extracting historical psychological insights.

Advancing Historical Psychological Text Analysis with Contextualized Construct Representation in Classical Chinese

Introduction to the Study

The paper introduces a novel computational pipeline called Contextualized Construct Representation (CCR) specifically tailored for historical-psychological text analysis in classical Chinese. The motivation stems from the need to explore rich historical corpora that encapsulate the psychological constructs of ancient populations, a task that remains underexplored due to the historical and linguistic complexities of such texts. The CCR pipeline innovatively combines psychometrics with advanced LLM representations, aiming to examine psychological constructs such as traditionalism and collectivism in classical Chinese texts. This approach addresses the significant gap in the literature by employing Transformer-based models and a new Chinese historical psychology corpus for fine-tuning these models, opening new avenues in the quantitative paper of history through the lens of psychological constructs.

Methodological Insights

The pipeline devised for the CCR method involves several key steps aimed at capturing the psychological constructs from historical texts. A notable feature is its use of expert knowledge in psychometrics and indirect supervised contrastive learning for model fine-tuning. The creation of the C-HI-PSY corpus, the first of its kind, along with the cross-lingual questionnaire conversion pipeline, stands out as a methodological innovation designed to address the linguistic challenges of engaging with classical Chinese texts. Additionally, the paper's approach to fine-tuning pre-trained Transformer models using this corpus underscores the paper's methodological rigor and its potential in enhancing the quality of psychological construct representation in historical texts.

Evaluation and Results

The CCR method demonstrated superior performance over existing word-embedding-based approaches and, interestingly, outperformed the prompting method with GPT-4 in most tasks. Such findings not only validate the efficacy of the CCR pipeline but also attest to the nuanced capability of contextualized models in understanding and representing psychological constructs within historical texts. Furthermore, the validation of CCR against historically verified attitudes towards reforms provided tangible evidence supporting the model's practical utility and its potential in contributing to our understanding of historical psychology through text analysis.

Benchmarking and Implications for Historical Psychology

The benchmarking of CCR using the dataset on officials' attitudes toward reform in the 11th century provides a concrete example of how the pipeline can be applied to real-world historical texts to extract psychological insights. The significant correlations found between the constructs of traditionalism and authority and the officials' attitudes underscore the method's relevance and effectiveness. Such an application not only illustrates the potential of CCR in historical-psychological research but also offers a new lens through which historical events and figures can be analyzed psychologically.

Conclusion and Future Directions

The paper makes a compelling case for the utilization of advanced NLP techniques in the exploration of psychological constructs within historical corpora. The successful development and validation of the CCR pipeline mark a significant advancement in the interdisciplinary field of historical psychology and computational linguistics. Looking ahead, the paper paves the way for further research into other languages and periods, expanding our understanding of historical psychology across different cultures and timeframes. Moreover, addressing the limitations noted in the paper, specifically the noise introduced by the indirect supervised learning approach, could further refine the CCR pipeline and enhance its applicability to a broader range of historical texts.

In conclusion, this research opens new frontiers in the computational analysis of historical texts, offering valuable tools and methodologies for historians, psychologists, and computational linguists interested in exploring the psychological dimensions of historical narratives. The future development of this field could significantly enrich our understanding of the human past, bridging the gap between historical events and psychological analysis.

Youtube Logo Streamline Icon: https://streamlinehq.com