Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transparent Human Evaluation for Image Captioning (2111.08940v2)

Published 17 Nov 2021 in cs.CL and cs.CV

Abstract: We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jungo Kasai (38 papers)
  2. Keisuke Sakaguchi (44 papers)
  3. Lavinia Dunagan (5 papers)
  4. Jacob Morrison (15 papers)
  5. Ronan Le Bras (56 papers)
  6. Yejin Choi (287 papers)
  7. Noah A. Smith (224 papers)
Citations (39)

Summary

We haven't generated a summary for this paper yet.