Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks (2401.15170v2)

Published 26 Jan 2024 in cs.CL and cs.AI

Abstract: Qualitative coding, or content analysis, extracts meaning from text to discern quantitative patterns across a corpus of texts. Recently, advances in the interpretive abilities of LLMs offer potential for automating the coding process (applying category labels to texts), thereby enabling human researchers to concentrate on more creative research aspects, while delegating these interpretive tasks to AI. Our case study comprises a set of socio-historical codes on dense, paragraph-long passages representative of a humanistic study. We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not. Compared to our human-derived gold standard, GPT-4 delivers excellent intercoder reliability (Cohen's $\kappa \geq 0.79$) for 3 of 9 codes, and substantial reliability ($\kappa \geq 0.6$) for 8 of 9 codes. In contrast, GPT-3.5 greatly underperforms for all codes ($mean(\kappa) = 0.34$; $max(\kappa) = 0.55$). Importantly, we find that coding fidelity improves considerably when the LLM is prompted to give rationale justifying its coding decisions (chain-of-thought reasoning). We present these and other findings along with a set of best practices for adapting traditional codebooks for LLMs. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis. Furthermore, they suggest the next generation of models will likely render AI coding a viable option for a majority of codebooks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Anselm L Strauss. 1967. The discovery of grounded theory: Strategies for qualitative research. Aldine.
  2. Johnny Saldaña. 2009. The coding manual for qualitative researchers. SAGE Publications.
  3. Momin M Malik. 2020. A hierarchy of limitations in machine learning. arXiv preprint arXiv:2002.05193.
  4. Laura K Nelson. 2017. Computational Grounded Theory: A methodological framework. Sociological Methods & Research, 49(1):3–42.
  5. 2021. Text categorization: past and present. Artificial Intelligence Review, 54:3007–3054.
  6. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712.
  7. 2023. Mathematical discoveries from program search with large language models. Nature, 625:1–3.
  8. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  9. 2023. Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces.
  10. 2023. LLM-assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv:2306.14924.
  11. 2023. LLM-in-the-loop: Leveraging large language model for thematic analysis. arXiv preprint arXiv:2310.15100.
  12. 2023. An examination of the use of large language models to aid analysis of textual data. bioRxiv preprint bioRxiv:2023.07.17.549361.
  13. Ammar Ismael Kadhim. 2019. Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52(1):273–292.
  14. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(1):993–1022.
  15. 2019. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78:15169–15211.
  16. Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
  17. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  18. 2023. LLaMa: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  19. MistralAI. 2023. Mixtral of experts: A high quality sparse mixture-of-experts. https://mistral.ai/news/mixtral-of-experts. Accessed: 2024-01-13.
  20. Anthropic. 2023. Claude 2. https://www.anthropic.com/index/claude-2. Accessed: 2024-01-18.
  21. 2023. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
  22. 2023. A confederacy of models: A comprehensive evaluation of LLMs on creative writing. arXiv preprint arXiv:2310.08433.
  23. Manon Bischoff. 2024. AI matches the abilities of the best Math Olympians. Scientific American.
  24. 2023. Gwet’s AC1 is not a substitute for Cohen’s kappa – A comparison of basic properties. MethodsX, 10:102212.
  25. OpenAI. 2023. GPT-4. https://openai.com/research/gpt-4. Accessed: 2024-01-18.
  26. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  27. 1997. Grounded theory in practice. SAGE Publications.
  28. 2023. Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4. arXiv preprint arXiv:2312.16171.
  29. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686.
  31. 2023. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402.
  32. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  33. 2023. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439.
  34. 2024. Introducing ASPIRE for selective prediction in LLMs. https://blog.research.google/2024/01/introducing-aspire-for-selective.html?m=1. Accessed: 2024-01-20.
  35. 2023. REFINER: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
  36. 2023. Reflexion: An autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366.
  37. 2022. ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  38. 2023. Towards reasoning in large language models via multi-agent peer review collaboration. arXiv preprint arXiv:2311.08152.
  39. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets