Best–Worst Scaling (BWS)

Updated 10 September 2025

Best–Worst Scaling (BWS) is a comparative judgment methodology that forces selections of extreme values in small item groups to produce robust, relative scores.
It demonstrates high empirical reliability—with correlations up to 0.88—and efficiency by requiring fewer annotations while reducing annotator cognitive load.
BWS is applied in diverse fields such as NLP, perceptual assessment, and multi-criteria decision-making, with recent advances incorporating fuzzy models and automated LLM-based annotations.

Best–Worst Scaling (BWS) is a comparative judgment methodology widely applied for reliable fine-grained measurement of attributes—such as sentiment intensity, emotion strength, and preference ordering. BWS presents respondents with small groups of items and requires selection of the “best” (maximum) and “worst” (minimum) instance relative to a target attribute. Its design, robustness, and diversity of application across annotation tasks and decision-making systems have been deeply characterized in recent research spanning natural language processing, human perceptual assessment, and multi-criteria optimization.

1. Foundations and Core Principles

Best–Worst Scaling (BWS), originally developed for comparative preference measurement, is based on extreme-point sampling within small item groups (“tuples,” most commonly 4-tuples). In each tuple, annotators are tasked with identifying:

The item expressing the maximal value of the target attribute (“best”)
The item expressing the minimal value (“worst”)

This forced-choice paradigm is distinguished by its ability to elicit sharper relative judgments, sidestepping issues present in direct rating scale protocols:

Rating scales depend on absolute value assignment, leading to scale-region bias and inconsistency over time.
BWS, by contrast, produces dense comparative relations that naturally support robust ranking and numerical scoring, particularly when absolute intensities are ambiguous.

Scores are computed using the formula:

$\text{score}(i) = \frac{\# \text{ times } i \text{ chosen as best} - \# \text{ times } i \text{ chosen as worst}}{\# \text{ times } i \text{ appears}}$

The output falls within $[-1, 1]$ and is linearly transformed to $[0, 1]$ for unipolar targets (e.g., emotion intensity).

Tuple generation typically follows the Random Maximum-Diversity Selection (RMDS) principle, ensuring each item appears in several tuples and each pair co-occurs at most once. This design increases pairwise diversity and boosts reliability.

2. Reliability and Comparative Efficiency

BWS demonstrates strong empirical reliability and efficiency advantages over standard annotation protocols. In sentiment and emotion annotation tasks:

Split-half reliability (Pearson, Spearman correlations) is consistently observed between $0.78$ and $0.88$ for BWS-annotated scores, contrasting with lower and less stable reliability from direct rating scales (Mohammad et al., 2017, Mohammad et al., 2017, Kiritchenko et al., 2017, Kiritchenko et al., 2017).
Reproducibility experiments indicate that the BWS score ordering remains remarkably robust regardless of annotator group. Lexicon creation studies confirm Spearman $\rho \geq 0.98$ when splitting annotators between repeated rounds.
Efficiency: Achieves equal reliability to rating scales with fewer annotations. BWS with $3N$ annotations matches the reliability of rating scales with $10N$ (Kiritchenko et al., 2017).
Annotator cognitive load is distinctly lower—each BWS judgment requires two extreme-point choices rather than fine-grained absolute decisions.

This reliability holds even for linguistically complex items involving negation, modal verbs, or sentiment composition, where rating scales diverge sharply (correlation as low as $\rho = -0.05$ for rating scales, while BWS remains high).

3. Applications in Language, Perception, and Decision-Making

BWS is deployed across several domains, including but not limited to:

a. Natural Language Annotation

Emotion and Sentiment Lexicons: Annotation of words and phrases in general English, English Twitter, and Arabic Twitter—constituting the SCL-NMA and related resources—relied on BWS to construct real-valued sentiment association scores. Minimum perceptible sentiment differences were empirically estimated at $0.069$–$0.087$ (Kiritchenko et al., 2017).
Emotion Intensity in Tweets: Major datasets of anger, fear, joy, and sadness annotated via BWS serve as gold standards for regression (Mohammad et al., 2017, Mohammad et al., 2017). These scores underpin machine learning systems using features such as word n-grams, character n-grams, word embeddings, and affect lexicons.
Automated labeling with LLMs: LLMs (GPT-4, DeepSeek-V3) now reliably perform BWS annotations, offering cost-effective, scalable labeling for continuous, fine-grained targets in emotion, biodiversity quantification, and other regression tasks. Agreement between LLMs and human judgment (Cohen’s $\kappa$ ) is high (Bagdon et al., 2024, Haider et al., 6 Feb 2025).

b. Multi-Criteria Decision-Making (MCDM)

Best-Worst Method (BWM): BWS forms the backbone of BWM, where experts identify “best” and “worst” criteria, supplying comparison vectors for weight estimation. Analytical models—multiplicative, fuzzy sets with $\alpha$ -cut intervals, taxicab distance and logarithmic least squares—ground the mathematical foundation for consistency and interval-weight determination (Csató, 2023, Ratandhara et al., 2023, Ratandhara et al., 2023, Ratandhara et al., 2024, Ratandhara et al., 8 Aug 2025).
Preference Disaggregation: Best–Worst Disaggregation (BWD) integrates extreme-point comparisons in value function inference, minimizing cognitive load and bias. Extensions to interval-valued inputs further enhance flexibility under uncertainty (Brunelli et al., 2024).

c. Perceptual Assessment

Audio and Perception (BWSNet): BWS judgments on sound samples are leveraged to construct latent embedding spaces that reflect human perceptual ordering. In the BWSNet framework, trial-wise best/worst relations are encoded as distance inequalities, driving neural network metric learning losses that yield faithful latent modelling of attributes such as social attitude or timbral quality (Veillon et al., 2023).

d. Experimental Design in Dialog and Sentiment

Iterative Best-Worst Scaling (IBWS): For robust crowdsourced ranking, IBWS applies BWS recursively akin to Quicksort, producing fine-grained orderings even in small samples. Although computationally intensive, IBWS outputs consistently serve as a high-quality target for evaluating alternative annotation strategies (Han et al., 2024).
Dialogue System Evaluation: Studies comparing BWS to Likert scales and magnitude estimation in readability/coherence reveal comparable inter-rater consistency, with continuous scales slightly outperforming BWS in reliability (Santhanam et al., 2019). Task design and rater experience are critical confounds.

4. Mathematical Extensions and Theoretical Characterization

BWS research includes rigorous formalization of score aggregation, consistency measures, and random utility model representations:

Score Aggregation: For $n$ items and $k$ tuples, each item’s score is the normalized difference between the fraction chosen as "best" minus "worst". For sentiment, this mapping is interval $[-1, 1]$ ; for unipolar targets in emotion detection, it is shifted to $[0, 1]$ .
Random Utility Models: Best–worst choice probabilities can be expressed as sums over probabilistic rankings, with Block–Marschak polynomials providing an irredundant decomposition. For sets of four or more, additional conditions beyond classical Block–Marschak inequalities are required (Colonius, 2024).
Consistency and Ordinal Protection: Analytical work in BWM provides sufficient conditions on pairwise comparison bounds to avoid ordinal violations (i.e., ensuring “best” gets maximal weight and “worst” gets minimal weight) in logarithmic least squares and multiplicative frameworks. Explicit closed-form formulas for weights, consistency indices (CI), and ratios (CR) are derived (Csató, 2023, Ratandhara et al., 2023, Ratandhara et al., 8 Aug 2025).
Fuzzy Generalizations: $\alpha$ -cut interval models yield more precise weight calculations and allow quantification of degree of approximation (DoA) for fuzzy input data (Ratandhara et al., 2023).

5. Comparative Evaluation and Practical Challenges

BWS is empirically contrasted with direct rating and ranking approaches:

Calibration and Bias: BWS avoids central tendency and region biases unaddressed by conventional scales. Specifically, forced-choice comparisons prevent subjective interpretation of scale points—a persistent problem in Likert or analog slider protocols (Kiritchenko et al., 2017, Han et al., 2024).
Efficiency and Cost: For small-scale or robust annotation needs, BWS is favored. For large-scale settings, cost-efficient direct methods (e.g., slider protocols) may suffice, approximating the rankings produced by IBWS with considerable reductions in annotation cost and time (Han et al., 2024).
Cognitive Load: Forced-choice paradigms incrementally lower cognitive burden compared to direct numerical rating, especially for subtle distinctions.

6. Implementation Protocols, Extensions, and Limitations

Best–Worst Scaling underpins a spectrum of feature-rich annotation and modeling systems:

Tuple Construction: RMDS maximizes diversity, ensuring no pair co-occurs more than once and all items receive adequate representation.
Feature Engineering: Downstream regression models utilize affect lexicons, n-ngrams, and distributed embeddings to predict BWS-based annotation scores (Mohammad et al., 2017, Mohammad et al., 2017).
Automation via LLMs: Prompts specifying roles and explicit task formatting reliably elicit valid best/worst selections from transformer models, supporting scalable annotation (Bagdon et al., 2024, Haider et al., 6 Feb 2025).
Metric Learning Extensions: Models such as BWSNet utilize BWS-derived trial relations as constraints on latent space embeddings, optimizing loss functions that preserve ordering and distinction (Veillon et al., 2023).
Multiple Solutions and Interval Weights: Analytical frameworks reveal that distance-based BWM models may admit multiple optimal weight sets—selection may involve secondary minimax objectives or solution interval centers (Ratandhara et al., 2024, Ratandhara et al., 2023, Ratandhara et al., 2023, Ratandhara et al., 8 Aug 2025).
Handling Uncertainty: Interval-valued extensions allow articulation of imprecision in expert judgments, yielding more robust value function determinations (Brunelli et al., 2024).

Methodological challenges persist related to scaling tuple-based annotations for large datasets, balancing between reliability/cost, and interpreting model outputs when multiple optimal configurations exist. In dialogue system evaluation, continuous scales may provide slightly superior reliability, but BWS remains preferred for relative judgment tasks or where ordinal, fine-grained distinctions are critical.

7. Impact and Future Directions

BWS has catalyzed advances in reliable annotation, affective computing, multi-criteria optimization, and perceptual modeling. It directly motivated the creation of benchmark emotion intensity datasets, fine-grained sentiment lexicons, and robust crowdsourced ranking protocols, and influenced automated annotation via LLMs. The generalization to fuzzy and nonlinear models, incorporation of random utility theoretical constructs, and analytical frameworks for interval-weight calculation suggest ongoing expansion into uncertainty modeling, multi-agent systems, and large-scale automated preference estimation.

Future research areas include further integration in learning-to-rank algorithms, efficient tuple sampling for extremely large datasets, consistency verification protocols, and deeper connections with probabilistic choice modeling. The broad flexibility of BWS—spanning human annotation, automated systems, and mathematical modeling—demonstrates its enduring utility within and beyond the scope of subjective evaluation tasks.