Can Large Language Models Transform Computational Social Science? (2305.03514v3)

Published 12 Apr 2023 in cs.CL and cs.LG

Abstract: LLMs are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 LLMs on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references. We conclude that the performance of today's LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.

PDF HTML Abstract

Evaluating the Potential of LLMs in Computational Social Science

The paper "Can LLMs Transform Computational Social Science?" by Ziems et al. explores the capacity of LLMs to supplement and enhance computational social science (CSS) research methodologies. The authors assess whether LLMs can reliably classify and elucidate social phenomena, such as persuasiveness and political ideology, without the need for extensive supervised training data. This document provides a comprehensive evaluation framework, addressing the adoption of LLMs in CSS and systematically examining their performance across a curated selection of representative tasks.

Methodological Framework

The authors first delineate the role of CSS, highlighting the potential for LLMs to obviate the constraints of resource-intensive data labeling processes. Subsequently, they devise a thorough evaluation schema, involving an extensive set of prompts and an array of both proprietary and open-source LLMs. Thirteen models are assessed across 25 CSS benchmarks, encapsulating a diverse spectrum of tasks typical to computational social science.

Evaluation Pipeline

The evaluation pipeline focuses on achieving two main outcomes: benchmarking zero-shot LLM performance against fine-tuned models and identifying instances where LLMs might serve as effective augmentative tools within human annotation workflows. Tasks are divided into utterance-level, conversation-level, and document-level analyses, reflective of the stratification inherent in CSS methodologies.

Utterance-Level Tasks: These include detecting dialect features, emotions, figurative language, hate speech, humor, and political ideologies, among others. The tasks are regarded as foundational, allowing for robust classification and coding necessary for later inferential analyses.
Conversation-Level Tasks: Analysis here targets the finer granularity of dialog acts and inter-personal dynamics, including empathy detection in mental health support contexts and power dynamics in social interactions.
Document-Level Tasks: The focus extends towards coding longer narrative constructs within media articles, emphasizing tasks like event argument extraction and ideological stance classification.

Model Performance

The paper yields several insights into the performance paradigms of LLMs:

Classification Tasks: LLMs failed to outperform well-tuned classifiers but nonetheless demonstrated significant levels of agreement with human annotations. These findings underscore the potential role of LLMs as supplementary annotators that can hasten the annotation process.
Generative Tasks: On tasks requiring text generation, such as explaining social biases or figurative language, some LLMs managed to produce outputs exceeding the quality of human reference in certain contexts. This accentuates their utility in generating explanatory content, which typically demands nuanced detailing and domain understanding.

Implications and Prospective Developments

The paper posits that while LLMs are not ready to completely replace human annotation, they can play an instrumental role in forming the backbone of a blended supervised-unsupervised labeling scheme. This implies a collaborative dynamic where LLM-generated annotations are curated and refined by human experts, potentially leading to substantial savings in annotation effort.

Future trajectories for LLM application in CSS could involve their integration into thematic analysis systems and the continuous refinement of model schemas through exposure to diverse datasets. The authors advocate for careful experimentation with LLMs as an evolving facet of CSS toolkits, driven by the understanding that comprehensive content analysis is paramount for yielding interpretable, socially valuable insights.

Moreover, this paper paves the way for further inquiries into the ethical considerations and constraints inherent in relying on models trained on previously existing data and underlines the exigency for regular updates and ethical model iterations. In sum, the research by Ziems et al. delineates a rich avenue for LLMs to augment the depth and reach of computational social science, pushing the boundaries of current analytical methods while addressing pertinent practical and theoretical implications.

PDF Markdown Bookmark Chat (Pro)

References (306)

Authors (6)

Caleb Ziems (22 papers)
William Held (17 papers)
Omar Shaikh (23 papers)
Jiaao Chen (31 papers)
Zhehao Zhang (18 papers)
Diyi Yang (151 papers)

Citations (188)

View on Semantic Scholar

YouTube

Show All Videos