Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with Large Language Models (2407.10645v1)

Published 15 Jul 2024 in cs.CL, cs.AI, and cs.CY

Abstract: LLMs have recently been applied to text annotation tasks from social sciences, equalling or surpassing the performance of human workers at a fraction of the cost. However, no inquiry has yet been made on the impact of prompt selection on labelling accuracy. In this study, we show that performance greatly varies between prompts, and we apply the method of automatic prompt optimization to systematically craft high quality prompts. We also provide the community with a simple, browser-based implementation of the method at https://prompt-ultra.github.io/ .

PDF HTML Abstract

Analyzing the Impact of Prompt Selection on Text Annotations Using LLMs

The paper "Prompt Selection Matters: Enhancing Text Annotations for Social Sciences with LLMs," authored by Louis Abraham, Charles Arnal, and Antoine Marie, presents a detailed paper on the influence of prompt selection on the accuracy of text annotation tasks utilizing LLMs. As LLMs, such as OpenAI's GPT-3.5 Turbo, have shown remarkable efficacy in various automated text annotation tasks, this paper highlights the necessity of optimizing prompts to achieve robust results in social sciences.

Introduction to Text Annotation in Social Sciences

Text annotation has traditionally involved manual classification by human experts or crowd workers, making the process both time-consuming and costly. With the advent of LLMs, automated text annotation has become a feasible alternative, offering significant advantages in terms of speed and cost-efficiency. LLMs have demonstrated high levels of accuracy in various annotation tasks, such as detecting political bias or emotional tone in text.

Investigating Prompt Selection

Despite the promising performance of LLMs, the paper identifies a crucial yet underexplored factor: the variation in accuracy induced by different prompt formulations. To quantify the impact of prompt selection, the authors systematically examine both manually crafted and automatically optimized prompts across several standard text annotation tasks in social sciences.

Experimental Setup

The paper evaluates multiple prompt types on diverse datasets, including:

TweetEval (TE) covering hate speech detection, emotion recognition, sentiment analysis, and offensive language detection.
Tweet Sentiment Multilingual (TML-sent) involving sentiment classification in multiple languages.
Article Bias Prediction (AS-pol) with labels representing political inclinations.
Liberals vs Conservatives on Reddit (LibCon) for detecting political leanings in Reddit posts.

Handcrafted Prompts vs. Automatic Prompt Optimization

The paper explores five different handcrafted prompt formulations:

Simple - Minimalist and direct prompts.
Explanations - Prompts enriched with additional explanatory context.
Examples - Prompts providing specific examples of correctly classified messages.
Roleplay - Prompts asking the LLM to answer while roleplaying as a political analyst.
Chain of Thoughts (CoT) - Prompts that include a step-by-step reasoning approach.

Additionally, the paper assesses the effectiveness of Automatic Prompt Optimization (APO), where an LLM iteratively rephrases and evaluates prompts to identify the best-performing version.

Results and Discussion

The results underscore significant variability in accuracy depending on the prompt used. For most tasks, the discrepancy between the highest and lowest accuracies achieved by different handcrafted prompts was considerable, highlighting the necessity of careful prompt selection.

Automatic Prompt Optimization demonstrated consistently high performance across all tasks, often surpassing the best handcrafted prompts. This suggests that APO can effectively identify high-quality prompts without requiring extensive manual tuning.

Implications and Future Directions

The findings have important implications for researchers in social sciences and developers of LLM-based applications. Proper prompt optimization can greatly enhance the accuracy and reliability of automated text annotation, making it a viable replacement for traditional methods.

Future research could explore additional methods to further refine prompt optimization. This includes testing whether LLMs can provide robust justifications for their classifications or associating confidence scores with their labels to enable targeted human review. Addressing issues related to the training data of LLMs, such as potential biases and the impact of updates, is also crucial for maintaining the replicability and fairness of annotation tasks.

Conclusion

This paper effectively illustrates the significance of prompt selection in the automatic annotation of text using LLMs. The proposed method of automatic prompt optimization not only simplifies the process but also routinely achieves high accuracy, thereby enhancing the efficacy of LLMs in social science research. The authors have provided a practical tool for the community, accessible via a browser-based service, facilitating the implementation of optimized prompts in various text annotation tasks.

Ultimately, this paper sets the stage for further advancements in automated text annotation, with potential applications extending beyond social sciences to any domain reliant on large-scale text classification.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Louis Abraham (11 papers)
Charles Arnal (16 papers)
Antoine Marie (15 papers)

Related Papers

Find Related Papers

GitHub

Prompt Ultra

Tweets

https://twitter.com/A_Marie_sci/status/1899424641963192682