Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP (2309.05619v2)

Published 11 Sep 2023 in cs.CL

Abstract: LLMs have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for LLMs in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of model performance with mean average error (MAE) as low as 0.4% and on average 13.8% better than using silver labels.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Wei Du (124 papers)
  2. Laksh Advani (2 papers)
  3. Yashmeet Gambhir (2 papers)
  4. Daniel J Perry (4 papers)
  5. Prashant Shiralkar (12 papers)
  6. Zhengzheng Xing (3 papers)
  7. Aaron Colak (4 papers)
Citations (1)