Prompt Stability Scoring for Text Annotation with Large Language Models

Published 2 Jul 2024 in cs.CL | (2407.02039v2)

Abstract: Researchers are increasingly using LMs for text annotation. These approaches rely only on a prompt telling the model to return a given output according to a set of instructions. The reproducibility of LM outputs may nonetheless be vulnerable to small changes in the prompt design. This calls into question the replicability of classification routines. To tackle this problem, researchers have typically tested a variety of semantically similar prompts to determine what we call ``prompt stability." These approaches remain ad-hoc and task specific. In this article, we propose a general framework for diagnosing prompt stability by adapting traditional approaches to intra- and inter-coder reliability scoring. We call the resulting metric the Prompt Stability Score (PSS) and provide a Python package \texttt{promptstability} for its estimation. Using six different datasets and twelve outcomes, we classify $\sim$3.1m rows of data and $\sim$300m input tokens to: a) diagnose when prompt stability is low; and b) demonstrate the functionality of the package. We conclude by providing best practice recommendations for applied researchers.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper presents a framework for diagnosing prompt stability using intra- and inter-prompt evaluations with Krippendorff’s Alpha.
It demonstrates that intra-prompt stability is generally high, while inter-prompt stability varies with prompt phrasing nuances.
The authors offer actionable recommendations to refine prompt design for more consistent and replicable LLM text annotation.

Prompt Stability Scoring for Text Annotation with LLMs

In "Prompt Stability Scoring for Text Annotation with LLMs," Christopher Barrie, Elli Palaiologou, and Petter Törnberg present a comprehensive framework to address a crucial issue in using LLMs for text annotation: the reproducibility of model outputs under varying prompt designs. While LLMs have significantly advanced the field of automated text classification, the authors recognize that subtle changes in prompt phrasing can lead to substantial variability in model outputs, raising concerns about the replicability of zero- or few-shot classification tasks.

Introduction and Motivation

The paper begins by highlighting the increasing prevalence of LLMs for text annotation tasks. Unlike traditional machine learning paradigms that rely on extensive training and testing datasets, LLMs often allow researchers to perform complex classification tasks with minimal input—typically in the form of a prompt describing the desired output. However, the authors point out that the fragility of these models' outputs to slight variations in prompt wording can undermine the stability and reliability of text classification systems. Therefore, the main objective of this paper is to establish a general framework for diagnosing "prompt stability" by drawing on methodologies from intra- and inter-coder reliability scoring.

Methodology

The authors propose a two-pronged approach for evaluating prompt stability: intra-prompt stability and inter-prompt stability.

Intra-Prompt Stability

Intra-prompt stability measures the consistency of model classifications over multiple iterations of the same prompt. Using Krippendorff's Alpha (KA) as the reliability metric, the authors simulate a standard research design where the same data is classified repeatedly by the same prompt. They apply this technique to classify multiple datasets, including Twitter messages by US Senators, UK political party manifestos, and New York Times articles, among others. The results are visualized to depict how the stability scores evolve over successive iterations, providing a quantitative measure of intra-prompt reliability.

Inter-Prompt Stability

Inter-prompt stability extends the analysis by examining the consistency of model outputs across different but semantically similar prompts. To generate these prompts, the authors use the PEGASUS model with varying "temperature" settings to produce paraphrased versions of the original prompt. By testing classification stability across different temperatures, the framework gauges how susceptible a model's outputs are to changes in prompt wording. This approach simulates scenarios where different researchers might design slightly varying prompts for similar tasks, offering insights into the robustness of LLMs under real-world conditions.

Results and Implications

The authors deploy their framework across six different datasets, evaluating twelve different outcomes and generating more than 150,000 classified data rows. The resulting analyses showcase several key findings:

High Intra-Prompt Stability: For most datasets, the models achieve high average intra-prompt stability scores (typically above 0.8 KA), indicating that given a consistent prompt, LLMs provide reliable annotations.
Variable Inter-Prompt Stability: Stability decreases with increasing semantic divergence of prompts. Scenarios involving ambiguous or complex constructs (e.g., populist language in tweets) exhibit lower stability, suggesting that prompt specificity and clarity are paramount.
Practical Recommendations: The authors provide best practices to enhance prompt stability, such as iterating on prompt design by evaluating paraphrased variations before committing to a specific prompt format.

Future Directions

This work has significant implications for both practical applications and theoretical understanding of LLM-based text annotation:

Enhanced Model Training: By diagnosing prompt instability, researchers can refine LLMs for more robust performance in various domains.
Field-Specific Adaptations: Future research might explore how prompt stability varies across different types of LLMs and domains, further expanding the utility of the authors’ framework.
Validation Criteria: This approach offers a blueprint for establishing new validation protocols to preempt issues like those observed in the replication crises of other scientific fields.

Conclusion

The authors of this paper provide a comprehensive, generalizable framework for assessing prompt stability in LLM-driven text annotation. By adapting well-established reliability metrics to the context of prompt engineering, they offer a critical tool for improving the robustness and reproducibility of LLM-based classification systems. The PromptStability Python package, developed as part of this work, promises to be a valuable resource for researchers engaged in computational text analysis, ensuring that future applications of LLMs are both reliable and replicable. This systematic approach introduces a much-needed layer of diagnostic rigor into the burgeoning field of AI-driven text annotation.

Markdown