Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 88 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Kimi K2 207 tok/s Pro

2000 character limit reached

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations (2411.15634v1)

Published 23 Nov 2024 in cs.CL, cs.AI, and stat.AP

Abstract: "Gold" and "ground truth" human-mediated labels have error. The effects of this error can escape commonly reported metrics of label quality or obscure questions of accuracy, bias, fairness, and usefulness during model evaluation. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans. We analyze human labels, GPT model ratings, and transformer encoder model annotations describing the quality of classroom teaching, an important, expensive, and currently only human task. We answer the question of whether such a task can be automated using two LLM architecture families--encoders and GPT decoders, using novel approaches to evaluating label quality across six dimensions: Concordance, Confidence, Validity, Bias, Fairness, and Helpfulness. First, we demonstrate that using standard metrics in the presence of poor labels can mask both label and model quality: the encoder family of models achieve state-of-the-art, even "super-human", results across all classroom annotation tasks. But not all these positive results remain after using more rigorous evaluation measures which reveal spurious correlations and nonrandom racial biases across models and humans. This study then expands these methods to estimate how model use would change to human label quality if models were used in a human-in-the-loop context, finding that the variance captured in GPT model labels would worsen reliabilities for humans influenced by these models. We identify areas where some LLMs, within the generalizability of the current data, could improve the quality of expensive human ratings of classroom instruction.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (1)

Michael Hardy

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations (2411.15634v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Authors (1)

Tweets

Don't miss out on important new AI/ML research

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations (2411.15634v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (1)

Tweets

Don't miss out on important new AI/ML research