Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Text Annotation through Rationale-Driven Collaborative Few-Shot Prompting (2409.09615v1)

Published 15 Sep 2024 in cs.CL and cs.AI
Enhancing Text Annotation through Rationale-Driven Collaborative Few-Shot Prompting

Abstract: The traditional data annotation process is often labor-intensive, time-consuming, and susceptible to human bias, which complicates the management of increasingly complex datasets. This study explores the potential of LLMs as automated data annotators to improve efficiency and consistency in annotation tasks. By employing rationale-driven collaborative few-shot prompting techniques, we aim to improve the performance of LLMs in text annotation. We conduct a rigorous evaluation of six LLMs across four benchmark datasets, comparing seven distinct methodologies. Our results demonstrate that collaborative methods consistently outperform traditional few-shot techniques and other baseline approaches, particularly in complex annotation tasks. Our work provides valuable insights and a robust framework for leveraging collaborative learning methods to tackle challenging text annotation tasks.

Traditional text annotation is a labor-intensive, time-consuming process prone to human bias and inconsistency, posing significant challenges as datasets grow in size and complexity. This paper explores leveraging LLMs to automate text annotation, proposing a novel Rationale-Driven Collaborative (RDC) few-shot prompting method to improve annotation quality and efficiency.

The core idea of the RDC method is to simulate a collaborative deliberation process among multiple LLMs. Instead of having LLMs generate outputs independently, RDC employs a sequential approach where LLMs build upon the outputs and reasoning of their predecessors. This contrasts with methods like Universal Self-Consistency (USC), which aggregates independent outputs (as depicted in Figure 1).

The methodology involves several key components:

  1. LLM Reasoning Process: Each LLM is prompted to generate not only the annotation answer (AA) but also a supporting rationale (RR). This initial step captures the model's thinking process: (A,R)=LLM(Q)(A, R) = \text{LLM}(Q), where QQ is the input query or text needing annotation.
  2. Collaborative Framework: In subsequent rounds (i>1i > 1), the LLM receives the original text statement (SS), the annotation (Ai1A_{i-1}), and the rationale (Ri1R_{i-1}) from the previous round. This information is incorporated into the prompt Pi=(S,Ai1,Ri1)P_i = (S, A_{i-1}, R_{i-1}), allowing the current LLM to leverage prior insights.
  3. Output Restriction and Error Mitigation: To maintain efficiency and prevent the accumulation of errors from potentially flawed past rounds or excessively long context, the method restricts the reference to only the most recent collaborative annotation output. The output for the ii-th round is (Ai,Ri)=LLM(Pi)(A_i, R_i) = \text{LLM}(P_i). This design aims to improve quality without manual intervention or complex prompting structures like Chain-of-Thought within each round.
  4. Similar Example Matching: To further enhance the process, previously annotated examples similar to the current text (EsimE_{sim}) can be included in the input prompt Einput=(E,Esim)E_{input} = (E, E_{sim}). The paper uses the Top-5 examples based on cosine similarity to provide context and guidance to the LLM.

The RDC method was evaluated against various baseline prompting techniques (Zero-Shot, Few-Shot, Chain-of-Thought, Universal Self-Consistency, and their similarity-aware variants) using six different LLMs (Qwen1.5 7B, 14B, 72B and LLaMA3 8B, 70B) across four benchmark datasets: SST-2 (binary sentiment), SST-5 (fine-grained sentiment), AG News (topic classification), and DBPedia (ontology classification). These datasets represent varying levels of complexity and task types (Table 1).

Experimental results (Table 2) show that RDC consistently outperforms the baseline methods across most models and datasets, particularly for more complex tasks like SST-5, AG News, and DBPedia. For instance, on SST-2, the average accuracy for RDC was 86.8%, compared to the next best (FS-simi and USC-simi) at 84.6%. On AG News, RDC averaged 78.5%, slightly ahead of USC-simi (76.7%). This suggests that the rationale-driven collaborative approach effectively improves annotation quality by enabling LLMs to refine their outputs iteratively. The paper also notes that incorporating similarity-based examples doesn't always guarantee superior performance over randomly selected ones and that larger models generally perform better, but RDC provides performance boosts across different model sizes. Additional experiments explored the impact of collaboration rounds, finding that accuracy generally improved with more rounds, reaching strong performance after 3 rounds (Figure 3 in the appendix).

In practice, implementing RDC involves setting up a pipeline that performs multiple rounds of inference for each text sample. For a given text:

  1. Round 1: Send the text to the LLM with a basic prompt including task instructions and potentially few-shot examples (random or similarity-matched). Request both the annotation and the rationale. Store this output (A1,R1)(A_1, R_1).
  2. Round 2: Send the original text, the instruction, and the output from Round 1 (A1,R1)(A_1, R_1) to the LLM. Request a new annotation and rationale (A2,R2)(A_2, R_2), explicitly asking the LLM to consider or refine based on the previous output.
  3. Subsequent Rounds (up to NN): Repeat the process, sending the original text, instruction, and the output from round i1i-1 (Ai1,Ri1)(A_{i-1}, R_{i-1}) to get (Ai,Ri)(A_i, R_i).
  4. Final Annotation: After NN rounds (e.g., 3 or more based on empirical results), use the final annotation ANA_N as the result for the text.

This iterative process incurs higher computational costs due to multiple inference calls per sample compared to single-pass methods. However, the paper argues this trade-off is acceptable for achieving higher annotation quality, which is critical for downstream model performance. The use of VLLM and GPUs (like A800) for inference suggests that optimizing inference throughput is crucial for practical deployment. The method's reliance on passing only the previous round's output (annotation + rationale) helps keep context windows manageable compared to approaches that might require the entire dialogue history.

The RDC method provides a robust framework for leveraging LLMs in text annotation workflows, offering a promising direction for automating data labeling while aiming for high quality and consistency, especially for complex classification tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jianfei Wu (2 papers)
  2. Xubin Wang (8 papers)
  3. Weijia Jia (42 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets