Generative LLMs offer a compelling alternative to traditional human annotation methods, promising speed, cost reduction, and reproducibility. However, the paper "Automated Annotation with Generative AI Requires Validation" (Pangakis et al., 2023 ) highlights that LLM performance on text annotation tasks is highly variable depending on factors like prompt quality, data characteristics, and task complexity. Critically, the paper argues that this variability necessitates rigorous validation against human-labeled data before deploying LLMs for automated annotation.
The paper proposes a five-step workflow to effectively and reliably integrate LLMs into the annotation process:
- Create Task-Specific Instructions (Codebook): The researcher defines the concepts and rules for annotation in a clear codebook, which will serve as the prompt for the LLM.
- Human Annotation of a Subset: Subject matter experts (ideally, high-quality annotators, not crowdsourced workers for this step) label a random subset of the text samples using the codebook. The size of this subset depends on the task and class imbalance (recommended 250-1250 samples, potentially more for rare classes).
- LLM Annotation and Initial Evaluation: The LLM annotates the same subset of human-labeled data using the same codebook. The LLM's performance is then evaluated against the human labels using metrics like accuracy, precision, recall, and F1 score.
- Codebook Refinement (if needed): If the LLM's performance on the subset is low, the researcher analyzes the misclassifications and refines the codebook instructions to address consistent errors. This step involves a human-in-the-loop process similar to prompt engineering. Steps 2 and 3 can be repeated with the updated codebook if necessary, though the authors note that one round often suffices and significant changes might warrant re-labeling by humans.
- Final LLM Performance Test: Using the final refined codebook, the LLM annotates the remaining human-labeled samples (held-out data). The performance on this held-out set determines whether the LLM is suitable for the specific task and dataset.
To validate this workflow, the authors replicated 27 distinct annotation tasks across 11 non-public datasets sourced from recent social science articles. They used GPT-4 via the OpenAI API. The results showed significant heterogeneity in LLM performance. While the median F1 score across all tasks was a promising 0.707, performance varied wildly, with F1 scores ranging from 0.059 to 0.969. Notably, nine out of 27 tasks had either precision or recall below 0.5, and some datasets exhibited large performance variations across different tasks. This confirms the core argument: validation is essential because good performance on one task or dataset does not guarantee good performance on others.
The paper also introduces the concept of a "consistency score" as a useful tool within this workflow. By having the LLM classify each text sample multiple times (e.g., 7 times) using a temperature setting greater than 0 (e.g., 0.6) to induce some randomness, a consistency score is calculated as the proportion of classifications that match the modal prediction for that sample.
The formula for consistency score for a single sample with repeated classifications is:
where is the indicator function and is the most frequent classification in .
The authors found a strong positive correlation between consistency scores and classification accuracy (including True Positive Rate and True Negative Rate). Samples labeled with perfect consistency (score of 1.0) were significantly more likely to be correct (19.4 percentage points higher accuracy on average) than those with imperfect consistency. This makes the consistency score a practical tool for identifying potentially difficult "edge cases" that might warrant human review even when using LLM annotation at scale.
Regarding the codebook refinement step (prompt engineering), the analysis showed that updates generally led to modest improvements in performance, primarily driven by increased precision. While not a guaranteed fix for poor performance, this step can help ensure that suboptimal instructions are not the primary cause of LLM errors.
Based on the validation performance, the paper outlines four potential use cases for LLM-augmented annotation:
- Confirming Human Label Quality: If you already have human-labeled data, compare LLM labels to human labels. High agreement suggests good quality in both; low agreement indicates issues in one or both.
- Identifying Cases for Prioritized Human Review: Use consistency scores to flag samples for human review. If the LLM achieves high recall but lower precision, use it to find potential positive cases, which humans can then verify (reducing the overall manual workload).
- Producing Labeled Data for Supervised Classifiers: If LLM performance is satisfactory, the LLM can generate a large volume of labeled data to train or fine-tune a smaller, potentially more efficient, supervised classifier for the main corpus.
- Classifying the Entire Corpus Directly: If the LLM performs exceptionally well on the held-out validation set, it can be used to label the entire remaining dataset directly.
For practical implementation, the authors mention making easy-to-use Python software available (linked in the paper) that implements this workflow. The cost for their validation paper (annotating over 200,000 samples with GPT-4, involving multiple passes for consistency scores) was approximately \$420 USD, demonstrating the potential cost-effectiveness compared to extensive manual annotation. The time taken was also relatively low (2-3 hours for 1,000 samples across multiple passes).
In summary, the paper presents a practical workflow for leveraging LLMs for text annotation, emphasizing that task-specific validation against human labels is non-negotiable due to performance variability. It provides empirical evidence for this variability and introduces tools like consistency scores to improve the reliability of LLM outputs in real-world applications. The recommended use cases offer different strategies for integrating LLMs based on the validation results, allowing researchers and practitioners to make informed decisions about when and how to apply this technology effectively and responsibly.