Traditional text annotation is a labor-intensive, time-consuming process prone to human bias and inconsistency, posing significant challenges as datasets grow in size and complexity. This paper explores leveraging LLMs to automate text annotation, proposing a novel Rationale-Driven Collaborative (RDC) few-shot prompting method to improve annotation quality and efficiency.
The core idea of the RDC method is to simulate a collaborative deliberation process among multiple LLMs. Instead of having LLMs generate outputs independently, RDC employs a sequential approach where LLMs build upon the outputs and reasoning of their predecessors. This contrasts with methods like Universal Self-Consistency (USC), which aggregates independent outputs (as depicted in Figure 1).
The methodology involves several key components:
- LLM Reasoning Process: Each LLM is prompted to generate not only the annotation answer () but also a supporting rationale (). This initial step captures the model's thinking process: , where is the input query or text needing annotation.
- Collaborative Framework: In subsequent rounds (), the LLM receives the original text statement (), the annotation (), and the rationale () from the previous round. This information is incorporated into the prompt , allowing the current LLM to leverage prior insights.
- Output Restriction and Error Mitigation: To maintain efficiency and prevent the accumulation of errors from potentially flawed past rounds or excessively long context, the method restricts the reference to only the most recent collaborative annotation output. The output for the -th round is . This design aims to improve quality without manual intervention or complex prompting structures like Chain-of-Thought within each round.
- Similar Example Matching: To further enhance the process, previously annotated examples similar to the current text () can be included in the input prompt . The paper uses the Top-5 examples based on cosine similarity to provide context and guidance to the LLM.
The RDC method was evaluated against various baseline prompting techniques (Zero-Shot, Few-Shot, Chain-of-Thought, Universal Self-Consistency, and their similarity-aware variants) using six different LLMs (Qwen1.5 7B, 14B, 72B and LLaMA3 8B, 70B) across four benchmark datasets: SST-2 (binary sentiment), SST-5 (fine-grained sentiment), AG News (topic classification), and DBPedia (ontology classification). These datasets represent varying levels of complexity and task types (Table 1).
Experimental results (Table 2) show that RDC consistently outperforms the baseline methods across most models and datasets, particularly for more complex tasks like SST-5, AG News, and DBPedia. For instance, on SST-2, the average accuracy for RDC was 86.8%, compared to the next best (FS-simi and USC-simi) at 84.6%. On AG News, RDC averaged 78.5%, slightly ahead of USC-simi (76.7%). This suggests that the rationale-driven collaborative approach effectively improves annotation quality by enabling LLMs to refine their outputs iteratively. The paper also notes that incorporating similarity-based examples doesn't always guarantee superior performance over randomly selected ones and that larger models generally perform better, but RDC provides performance boosts across different model sizes. Additional experiments explored the impact of collaboration rounds, finding that accuracy generally improved with more rounds, reaching strong performance after 3 rounds (Figure 3 in the appendix).
In practice, implementing RDC involves setting up a pipeline that performs multiple rounds of inference for each text sample. For a given text:
- Round 1: Send the text to the LLM with a basic prompt including task instructions and potentially few-shot examples (random or similarity-matched). Request both the annotation and the rationale. Store this output .
- Round 2: Send the original text, the instruction, and the output from Round 1 to the LLM. Request a new annotation and rationale , explicitly asking the LLM to consider or refine based on the previous output.
- Subsequent Rounds (up to ): Repeat the process, sending the original text, instruction, and the output from round to get .
- Final Annotation: After rounds (e.g., 3 or more based on empirical results), use the final annotation as the result for the text.
This iterative process incurs higher computational costs due to multiple inference calls per sample compared to single-pass methods. However, the paper argues this trade-off is acceptable for achieving higher annotation quality, which is critical for downstream model performance. The use of VLLM and GPUs (like A800) for inference suggests that optimizing inference throughput is crucial for practical deployment. The method's reliance on passing only the previous round's output (annotation + rationale) helps keep context windows manageable compared to approaches that might require the entire dialogue history.
The RDC method provides a robust framework for leveraging LLMs in text annotation workflows, offering a promising direction for automating data labeling while aiming for high quality and consistency, especially for complex classification tasks.