Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 33 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 220 tok/s Pro
GPT OSS 120B 465 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation (1712.01765v1)

Published 5 Dec 2017 in cs.CL

Abstract: Rating scales are a widely used method for data annotation; however, they present several challenges, such as difficulty in maintaining inter- and intra-annotator consistency. Best-worst scaling (BWS) is an alternative method of annotation that is claimed to produce high-quality annotations while keeping the required number of annotations similar to that of rating scales. However, the veracity of this claim has never been systematically established. Here for the first time, we set up an experiment that directly compares the rating scale method with BWS. We show that with the same total number of annotations, BWS produces significantly more reliable results than the rating scale.

Citations (169)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that Best–Worst Scaling achieves higher consistency with a split‐half reliability of 0.98 compared to 0.95 for rating scales.
  • It reveals that BWS requires only about 30% of the annotations needed by rating scales to reach similar reliability levels in sentiment intensity scoring.
  • The study highlights that BWS maintains robust performance for complex phrases, whereas rating scales exhibit significant reliability drops for negations and modal verbs.

The paper "Best–Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation" presents an experimental comparison between two annotation methods: Best–Worst Scaling (BWS) and Rating Scales (RS). The authors, Svetlana Kiritchenko and Saif M. Mohammad, conduct a meticulous paper to evaluate the reliability of BWS in contrast to the more traditionally employed RS method for sentiment intensity annotation.

Motivation and Context

Rating scales have been a prevalent method in various domains, including social sciences and computational linguistics, for annotating data. However, RS methods face challenges such as inconsistencies among different annotators (inter-annotator) and the same annotator over time (intra-annotator), scale region bias, and fixed granularity. To address these challenges, the authors explore an alternative method—Best–Worst Scaling (BWS).

BWS, derived from comparative annotation methods and rooted in mathematical psychology, requires annotators to assess a set of items (commonly a 4-tuple) and determine which item represents the extremes (best and worst) in terms of a certain property, such as sentiment intensity. This method promises high-quality annotations with fewer annotations needed, but its efficacy relative to traditional RS had not been systematically validated prior to this paper.

Experimental Design

The authors designed an experiment involving the annotation of 3,207 English terms (including single words and phrases) to compare the effectiveness of RS and BWS. Both methods were applied to generate sentiment intensity scores:

  • Rating Scale: Annotators rated terms on a 9-point scale from -4 (extremely negative) to +4 (extremely positive).
  • Best–Worst Scaling: Annotators evaluated sets of four terms, identifying the most and least positive terms in each set.

For a fair comparison, the same total number of annotations was obtained in both methods.

Key Findings

  1. Consistency and Reliability: BWS demonstrated significantly higher consistency and reliability than RS. In particular, the split-half reliability (SHR), a measure of reproducibility, was notably superior for BWS. With a full annotation set, BWS achieved an SHR of 0.98 compared to 0.95 for RS. Furthermore, BWS required only about 30% of the annotations to reach a similar reliability level as RS.
  2. Impact of Annotation Quantity: The reliability advantages for BWS were most pronounced when fewer than five annotations per term were available, which is common in NLP projects.
  3. Handling of Complex Phrases: The paper found that BWS maintained its reliability for more linguistically complex items, such as those containing negations and modal verbs, whereas RS exhibited significant reliability drops for such items.
  4. Correlation of Results: Despite discrepancies in detailed scores between BWS and RS, the BWS method correlated strongly in terms of ranking the items. Correlations were consistently above 0.9 for simpler items. However, for complex phrases, RS showed lower correlations and greater inconsistencies.

Implications

The findings bolster the proposition that BWS is a more robust method than traditional rating scales for sentiment annotation, especially when resources are constrained, or items are linguistically complex. This has significant implications for NLP and other fields relying on scalable and reliable sentiment analysis. The paper supports increased adoption of BWS in practical applications to enhance the quality of annotated datasets.

Conclusion

The authors make a compelling case for the adoption of BWS in sentiment annotation tasks by demonstrating its superior reliability and consistency compared to RS. They suggest that their findings will encourage broader use of BWS to obtain high-caliber NLP annotations and potentially re-evaluate existing sentiment lexicons generated via rating scales. The authors also provide resources, including scripts and annotated data, to facilitate further exploration and adoption of BWS methodologies.