Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models for Data Annotation and Synthesis: A Survey (2402.13446v3)

Published 21 Feb 2024 in cs.CL
Large Language Models for Data Annotation and Synthesis: A Survey

Abstract: Data annotation and synthesis generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced LLMs, exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation and synthesis. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.

This paper, "LLMs for Data Annotation and Synthesis: A Survey" (Tan et al., 21 Feb 2024 ), provides a comprehensive overview of how LLMs can be utilized to automate and enhance the data annotation process. It highlights the increasing cost and labor associated with traditional human annotation methods and positions LLMs as a promising solution to this bottleneck, particularly given their advanced capabilities in understanding and generating human-quality text.

The survey is structured around three core areas: LLM-Based Data Annotation, Assessing LLM-generated Annotations, and Learning with LLM-generated Annotations.

LLM-Based Data Annotation:

This section details various methodologies for using LLMs to generate annotations. The primary approach involves leveraging the LLM's in-context learning or few-shot capabilities through carefully designed prompts.

  • Prompting: This is the most common method. By providing the LLM with clear instructions and a few examples of the desired annotation task, the model can often generate labels for unseen data. Different prompting strategies exist, including zero-shot (instructions only), few-shot (instructions plus examples), and chain-of-thought prompting (asking the LLM to explain its reasoning process). Implementing this typically involves formulating the annotation task as a text generation problem, where the input is the data to be annotated (e.g., text snippet) and the desired output format for the annotation (e.g., a label, a structured JSON object, or a corrected/rephrased text).
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    
    # Example Prompt for Sentiment Analysis
    prompt = """Task: Classify the sentiment of the following reviews as Positive, Negative, or Neutral.
    
    Review: This product is amazing! I love it.
    Sentiment: Positive
    
    Review: It was okay, nothing special.
    Sentiment: Neutral
    
    Review: I hated it. The quality was terrible.
    Sentiment: Negative
    
    Review: {text_to_annotate}
    Sentiment:"""
    # Use an LLM API (like OpenAI, Anthropic, etc.) to get completion
    # response = LLM_api(prompt.format(text_to_annotate=review_text))
    # predicted_sentiment = response.text.strip()
  • Fine-tuning: While full fine-tuning might be less common specifically for generating annotations compared to prompting, domain-specific or task-specific fine-tuning of smaller models on a limited amount of high-quality human-annotated data can enhance LLM performance for particular annotation tasks. This approach requires more computational resources than prompting but can lead to more specialized and potentially higher-quality outputs for specific domains or complex annotation schemes.
  • Specific Annotation Tasks: LLMs can be applied to a wide range of annotation tasks, including text classification (sentiment, topic), sequence labeling (Named Entity Recognition, Part-of-Speech tagging), relation extraction, summarization, translation, and data synthesis (generating new data points based on certain criteria). The prompt design needs to be tailored to the specific task requirements.

Assessing LLM-generated Annotations:

Evaluating the quality and reliability of LLM-generated annotations is crucial before using them to train downstream models.

  • Evaluation Metrics: Standard NLP evaluation metrics like accuracy, precision, recall, F1-score, and inter-annotator agreement (e.g., Kappa, Alpha) can be used by comparing LLM outputs against a small gold standard set of human annotations.
  • Human Evaluation: Human review is essential, especially for subjective tasks or to identify nuanced errors that metrics might miss. A workflow might involve using LLMs for a first pass of annotation, followed by human annotators reviewing and correcting the LLM outputs.
  • Consistency Checks: Evaluating the consistency of LLM outputs across similar inputs or with variations in prompts helps gauge reliability. Techniques like asking the LLM to justify its decision or using different prompts for the same item and checking for agreement can be employed.

Learning with LLM-generated Annotations:

LLM-generated annotations can serve as a source of training data for downstream models. Several strategies address the potential noise or biases in these annotations.

  • Training with Noisy Labels: LLM annotations, while abundant, can contain noise. Techniques designed for learning with noisy labels can be applied. This might involve robust loss functions, noise modeling, or filtering/weighting instances based on predicted label quality or confidence scores from the LLM.
  • Knowledge Distillation: LLMs can act as 'teachers' to train smaller, more efficient 'student' models. The LLM generates soft labels or rationales, which are then used to train the student model. This allows deploying smaller models while leveraging the knowledge of the larger LLM.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    # Pseudocode for Knowledge Distillation
    # Assume `large_LLM` is the teacher model, `small_model` is the student
    # `unlabeled_data` is the dataset to annotate with LLM
    # `human_data` is an optional small set of human labels
    
    # Step 1: Generate annotations using the large LLM
    LLM_annotations = {}
    for item in unlabeled_data:
        prompt = create_LLM_prompt(item) # Tailor prompt for task
        response = large_LLM(prompt)
        LLM_annotations[item] = parse_LLM_response(response) # Extract label/rationale
    
    # Step 2: Combine LLM annotations with human data (optional but recommended)
    # Filter or trust LLM annotations based on confidence/consistency if possible
    training_data = combine_datasets(human_data, LLM_annotations)
    
    # Step 3: Train the small model on the combined data
    # Use standard supervised learning or techniques robust to noise
    train(small_model, training_data, epochs, learning_rate)
    
    # Alternatively, use soft targets from LLM if the LLM provides probabilities/scores
    # soft_targets = large_LLM_predict_probs(unlabeled_data)
    # train(small_model, unlabeled_data, soft_targets, distillation_loss)
  • Data Augmentation/Synthesis: LLMs can generate synthetic data instances along with their annotations, expanding the training set, especially for rare classes or specific scenarios. This can improve model generalization.

Challenges and Limitations:

The paper also discusses significant challenges.

  • Cost: While cheaper per label than human annotation, using powerful LLMs (especially proprietary ones via APIs) at scale can still be expensive.
  • Quality & Consistency: LLMs can hallucinate, produce factually incorrect or nonsensical outputs, and may exhibit inconsistent labeling behavior. Their performance is highly sensitive to prompt wording and input format.
  • Bias: LLMs inherit biases from their training data, which can be amplified in the generated annotations, leading to biased downstream models.
  • Task Complexity: LLMs may struggle with highly nuanced, domain-specific, or multi-faceted annotation tasks that require deep expertise or complex reasoning.
  • Data Privacy and Security: Sending sensitive or proprietary data to external LLM APIs raises privacy and security concerns.
  • Explainability: It can be difficult to understand why an LLM produced a specific annotation, hindering debugging and trust.

Ethical Considerations:

The use of LLMs for annotation raises ethical questions, including potential job displacement for human annotators and the risk of amplifying societal biases present in training data. Responsible deployment requires mitigating these risks.

Practical Implications:

Implementing LLM-based annotation involves selecting the right LLM (considering cost, capabilities, and privacy), crafting effective prompts through iterative testing and potentially using prompt engineering techniques, building a workflow that might combine LLM annotation with human review for quality control, and choosing appropriate training strategies for downstream models that can handle potential noise in the LLM-generated data. Tools and platforms integrating LLMs into annotation workflows are emerging to facilitate this process.

Overall, the survey positions LLMs as a transformative tool for data annotation, offering scalability and efficiency, while also emphasizing the critical need for careful assessment, robust learning strategies, and addressing inherent challenges and ethical considerations for successful real-world application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (134)
  1. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306.
  2. Bernardo Aceituno and Antoni Rosinol. 2022. Stack ai: The middle-layer of ai.
  3. Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689.
  4. Self-consuming generative models go mad. ArXiv, abs/2307.01850.
  5. Open-source large language models outperform crowd workers and approach chatgpt in text-annotation tasks. arXiv preprint arXiv:2307.02179.
  6. Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus, 15(2).
  7. Walid Amamou. 2021. Ubiai: Text annotation tool.
  8. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub.
  9. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  10. Large language models and the perils of their hallucinations. Critical Care, 27(1):1–2.
  11. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189.
  12. Parikshit Bansal and Amit Sharma. 2023. Large language models as annotators: Enhancing generalization of nlp models at minimal cost. arXiv preprint arXiv:2306.15766.
  13. A drop of ink may make a million think: The spread of false information in large language models. arXiv preprint arXiv:2305.04812.
  14. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  15. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  16. A survey on evaluation of large language models.
  17. Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected? arXiv preprint arXiv:2309.13788.
  18. Improving in-context few-shot learning via self-supervised training. In NAACL.
  19. Disco: Distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528.
  20. Socially responsible ai algorithms: Issues, purposes, and challenges. Journal of Artificial Intelligence Research, 71:1137–1181.
  21. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  22. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality.
  23. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  24. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  25. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  26. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019.
  27. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  28. Can ai language models replace human participants? Trends in Cognitive Sciences.
  29. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  30. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  31. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  32. Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982.
  33. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
  34. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  35. Koala: A dialogue model for academic research. BAIR Blog.
  36. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
  37. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980.
  38. Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events. arXiv preprint arXiv:2307.06439.
  39. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
  40. The false promise of imitating proprietary llms. ArXiv, abs/2305.15717.
  41. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  42. Textbooks are all you need. ArXiv, abs/2306.11644.
  43. Chase Harrison. 2022. Langchain.
  44. Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
  45. Selective annotation makes language models better few-shot learners. In ICLR.
  46. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
  47. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
  48. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782.
  49. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845.
  50. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  51. Large language model distillation doesn’t need a teacher. arXiv preprint arXiv:2305.14864.
  52. Disinformation detection: An evolving challenge in the age of llms. arXiv preprint arXiv:2309.15847.
  53. Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870.
  54. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  55. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050.
  56. Prefer to classify: Improving text classifiers via auxiliary preference learning. arXiv preprint arXiv:2306.04925.
  57. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  58. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  59. Semantic role labeling with pretrained language models for known and unknown predicates. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 619–628.
  60. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
  61. Contextualization distillation from large language model for knowledge graph completion. arXiv preprint arXiv:2402.01729.
  62. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149.
  63. Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap. arXiv preprint arXiv:2306.01941.
  64. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  65. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 3.
  66. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  67. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  68. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
  69. Is prompt all you need? no. a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475.
  70. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  71. Active learning principles for in-context learning with large language models. arXiv preprint arXiv:2305.14264.
  72. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
  73. Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183.
  74. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772.
  75. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys.
  76. Ines Montani and Matthew Honnibal. 2018. Prodigy: A new annotation tool for radically efficient machine teaching. Artificial Intelligence, to appear.
  77. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  78. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
  79. OpenAI. 2023. Gpt-4 technical report.
  80. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  81. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
  82. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
  83. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  84. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  85. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  86. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671.
  87. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  88. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  89. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618.
  90. Active learning for sequence tagging with deep pre-trained models and bayesian uncertainty estimates. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, pages 1698–1712.
  91. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715.
  92. The curse of recursion: Training on generated data makes models forget. ArXiv, abs/2305.17493.
  93. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871.
  94. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492.
  95. An information-theoretic approach to prompt engineering without ground truth labels. arXiv preprint arXiv:2203.11364.
  96. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  97. Is chatgpt good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542.
  98. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
  99. Active learning helps pretrained models learn the intended task. Advances in Neural Information Processing Systems, 35:28140–28153.
  100. Gkd: A general knowledge distillation framework for large-scale pre-trained language model. arXiv preprint arXiv:2306.06629.
  101. Sparsity-guided holistic explanation for llms with interpretable inference-time intervention. arXiv preprint arXiv:2312.15033.
  102. Interpreting pretrained language models via concept bottlenecks. arXiv preprint arXiv:2311.05014.
  103. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  104. Stanford alpaca: An instruction-following llama model.
  105. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  106. Large language models in medicine. Nature Medicine, pages 1–11.
  107. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  108. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  109. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.
  110. Web content filtering through knowledge distillation of large language models. arXiv preprint arXiv:2305.05027.
  111. Revisiting relation extraction in the era of large language models. arXiv preprint arXiv:2305.05003.
  112. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879.
  113. Noise-robust fine-tuning of pretrained language models via external guidance. arXiv preprint arXiv:2311.01108.
  114. Knowledge editing for large language models: A survey. arXiv preprint arXiv:2310.16218.
  115. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
  116. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  117. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  118. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  119. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  120. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  121. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  122. Huggingface’s transformers: State-of-the-art natural language processing.
  123. Scattershot: Interactive in-context example curation for text transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 353–367.
  124. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
  125. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848.
  126. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031.
  127. Zerogen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669.
  128. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
  129. Temporal data meets llm–explainable financial time series forecasting. arXiv preprint arXiv:2306.11025.
  130. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464.
  131. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  132. Lmturk: Few-shot learners as crowdsourcing workers in a language-model-as-a-service framework. arXiv preprint arXiv:2112.07522.
  133. A survey of large language models.
  134. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Zhen Tan (68 papers)
  2. Alimohammad Beigi (6 papers)
  3. Song Wang (313 papers)
  4. Amrita Bhattacharjee (24 papers)
  5. Bohan Jiang (16 papers)
  6. Mansooreh Karami (14 papers)
  7. Jundong Li (126 papers)
  8. Lu Cheng (73 papers)
  9. Huan Liu (283 papers)
  10. Dawei Li (75 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com