Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Coding Social Science Datasets with Language Models (2306.02177v1)

Published 3 Jun 2023 in cs.AI

Abstract: Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This kind of human coding forms an important part of social science research, yet the coding process is both resource intensive and highly variable from application to application. In some cases, efforts to automate this process have achieved human-level accuracies, but to achieve this, these attempts frequently rely on thousands of hand-labeled training examples, which makes them inapplicable to small-scale research studies and costly for large ones. Recent advances in a specific kind of artificial intelligence tool - LLMs (LMs) - provide a solution to this problem. Work in computer science makes it clear that LMs are able to classify text, without the cost (in financial terms and human effort) of alternative methods. To demonstrate the possibilities of LMs in this area of political science, we use GPT-3, one of the most advanced LMs, as a synthetic coder and compare it to human coders. We find that GPT-3 can match the performance of typical human coders and offers benefits over other machine learning methods of coding text. We find this across a variety of domains using very different coding procedures. This provides exciting evidence that LLMs can serve as a critical advance in the coding of open-ended texts in a variety of applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Christopher Michael Rytting (6 papers)
  2. Taylor Sorensen (14 papers)
  3. Lisa Argyle (2 papers)
  4. Ethan Busby (2 papers)
  5. Nancy Fulda (10 papers)
  6. Joshua Gubler (3 papers)
  7. David Wingate (24 papers)
Citations (7)