Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation (2506.03857v1)

Published 4 Jun 2025 in cs.LG and cs.CL

Abstract: Recently, LLMs have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein LLMs are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small LLM (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.

Authors (7)

Mingxuan Xia (4 papers)
Haobo Wang (45 papers)
Yixuan Li (183 papers)
Zewei Yu (1 paper)
Jindong Wang (150 papers)
Junbo Zhao (86 papers)
Runze Wu (28 papers)

Summary

An Analytical Overview of "Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation"

The paper, "Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation," poses a significant advancement in the automation of data annotation leveraging LLMs. It identifies a core challenge with existing LLM-driven annotation strategies, where LLMs are required to determine a single label for each data point despite inherent uncertainties, often leading to erroneous annotations. This limitation not only heightens computational costs but also compromises the quality of data annotation.

Candidate Annotations: A Strategic Approach

The authors introduce a novel paradigm termed "Prompt Candidate Annotations," inspired by the concept of human ambiguity aversion. Here, rather than compelling LLMs to assert a single label in uncertain contexts, the model is directed to generate a set of possible labels. This strategy significantly increases the likelihood of encapsulating the true label within the generated set, thereby ensuring more reliable annotations. The paradigm is rigorously tested across various NLP tasks, demonstrating statistical superiority with higher inclusion rates of correct labels.

CanDist Framework: Teacher-Student Collaboration

To operationalize candidate annotations for downstream tasks that require unique labels, the authors propose the CanDist framework. This framework employs a teacher-student methodology whereby a robust Small LLM (SLM), termed the student, is distilled from the LLM's (teacher's) candidate annotations. The process encompasses a dynamic distillation mechanism that refines predicted label distributions and filters annotations to produce specific labels. This is underpinned by a theoretical justification highlighting the robustness of distilling from multiple candidate labels compared to single-label outputs.

Numerical Validation and Theoretical Implications

Empirical validation of the CanDist framework spans six text classification tasks, where it consistently achieves superior performance compared to existing benchmarks. Notably, in tasks such as TREC and DBpedia, it reaches accuracy levels nearly as high as supervised approaches that utilize fine-tuned data. These results underpin the paper's assertion that incorporating candidate annotations and distilling them systematically can significantly enhance annotation accuracy and reliability.

Theoretical analysis further substantiates CanDist's methodological claims by demonstrating improved noise tolerance when distilling from candidate sets. This contributes to a broader understanding of label noise challenges in machine learning and asserts the efficacy of collaborative teacher-student models in overcoming such challenges.

Future Directions and Implications

The paper's findings offer substantial implications for both practical applications and theoretical advancements in AI. Practically, CanDist opens avenues for more efficient, less resource-intensive data annotation strategies that can reduce reliance on human annotators, which is particularly vital in large-scale NLP projects. Theoretically, the framework suggests pathways for enhancing learning algorithms through collaborative models, which could be explored in more complex AI systems.

In conclusion, "Prompt Candidates, then Distill" presents a well-founded and empirically validated method to address prevailing challenges in LLM-driven data annotation. Its contribution lies not only in a novel approach to candidate label generation but also in establishing a framework for effective knowledge distillation, setting a foundation for future research in automatic data annotation and beyond.

PDF Markdown