Comparative Judgment Test Data

Updated 18 January 2026

Comparative judgment test data is a collection of pairwise assessments where evaluators select the better option instead of assigning absolute scores.
It uses models like Bradley-Terry, Thurstone, and Elo to estimate latent item quality, reducing bias and enhancing reliability in global ranking inference.
Active learning and optimized experimental designs, such as entropy-driven approaches, boost data efficiency and fairness in diverse applications.

Comparative judgment test data refers to collections of outcomes from human or machine assessments in which the evaluator chooses the better item from a presented pair, rather than assigning absolute scores. This paradigm, originally motivated by the Law of Comparative Judgment (LCJ), underpins a variety of practical, statistical, and theoretical advances in education, psychometrics, evaluation science, fairness measurement, and machine learning. Comparative judgment (CJ) test data are directly used to infer global rankings, measure reliability, improve scaling and fairness, and reduce cognitive effort relative to traditional scoring. The development and analysis of CJ data demand rigorous statistical models and purpose-built experimental designs, as evidenced by recent work across educational assessment, software fairness, active learning, and human annotation research.

1. Comparative Judgment: Principles and Data Structure

CJ test data consists of records for each pair ( $i$ , $j$ ) in the item set, with a binary indicator of which item was preferred by the assessor. This protocol eschews pointwise ratings in favor of iterative pairwise comparisons. For an item set $I = \{1, \ldots, N\}$ and comparison budget $B$ , the data can be formalized as a sequence of pairs $G=[(i_1,j_1),\ldots,(i_B,j_B)]$ and corresponding outcomes $W=[w_1,\ldots,w_B]$ , $w_k \in \{i_k, j_k\}$ (Gray et al., 2023). Extensions include triadic "odd-one-out" judgments for similarity analysis (Victor et al., 2023), and comparative test records for fairness metrics (Xi et al., 11 Jan 2026).

Key properties:

Pairwise choices rather than absolute scores
Binary or ordinal outcomes (“A is better than B”)
Sparse adjacency structures in the resulting win/loss graph
Amenable to aggregation via statistical ranking models

2. Canonical Models for CJ Data

The analysis of CJ data typically employs the Bradley–Terry (BT), Bradley–Terry–Luce (BTL), or Thurstone models, each modeling the probability an item $i$ is preferred to item $j$ :

BT Model: $P(i \succ j) = \frac{\gamma_i}{\gamma_i+\gamma_j}$ , where $\gamma_i$ is item $i$ 's latent quality (Gray et al., 2022, Gray et al., 2023, Kim et al., 2024). Parameter estimation proceeds via minorisation–maximisation (MM) updates.
Thurstone Model: Assumes underlying normal latent variables $X_i \sim N(\mu_i,\sigma_i^2)$ and models preference probability via $P(X_i > X_j)$ (Shah et al., 2014).
Elo Rating System: Incremental update method adapted from chess; after each comparison, the rating for item $A$ is updated as $R'_A = R_A + K (S_A - E_A)$ with $E_A$ the expected win probability (Gray et al., 2022).
Bayesian CJ (BCJ): Places priors (typically Gaussian or Beta) on item abilities, yielding posterior samples of ranks and uncertainty estimates (Gray et al., 17 Mar 2025, Gray et al., 2023).

Comparative judgment modeling allows direct inference of global item rankings, estimation of uncertainty in rank assignments, and probabilistic interpretations of comparative outcomes (Gray et al., 17 Mar 2025).

3. Experimental Design and Data Collection Efficiency

Efficiency and statistical power of comparative judgment depend on experimental design—the selection and scheduling of pairs for comparison. Key approaches:

Random Pairing: Each pair is sampled randomly, maximally reducing bias but can be inefficient for large $N$ (Gray et al., 2023).
No–Repeat or Least-Seen Pairing: Ensures even coverage; prone to overfitting artifacts if oversampled (Gray et al., 2023).
Entropy-Driven Active Learning: Prioritizes pairs with maximal uncertainty in win probability (highest entropy of Beta posterior), optimizing information gain per comparison (Gray et al., 2023).
Static Design via Spectral Decomposition: Constructs optimal sampling distribution over pairs by analyzing the covariance of pairwise differences, traditionally via principal component analysis of the design matrix; recent advances use reduced basis decomposition (RBD) for scalability to $N \gg 150$ (Jiang et al., 22 Dec 2025).

RBD achieves two-to-three orders of magnitude speedup and machine-precision accuracy, enabling real-time experiment updates and classroom-scale deployments (452 items in <7 minutes) (Jiang et al., 22 Dec 2025).

4. Reliability, Fairness, and Statistical Power

Comparative judgment data often yield higher inter-rater reliability (IRR) than pointwise scoring, as shown by Krippendorff’s $\alpha$ :

Task	Categorical $\alpha$	Comparative $\alpha$	$\Delta \alpha$
Short answer (10)	0.66 [0.64, 0.67]	0.80 [0.78, 0.82]	+0.14
Oral fluency (44)	0.70 [0.68, 0.73]	0.78 [0.77, 0.79]	+0.08

(Henkel et al., 2023)

Comparative judgments reduce rater noise, yield faster completion times, and are less susceptible to scale-use bias (Shah et al., 2014). For statistical hypothesis testing (e.g., fairness via “separation” or “equalized odds”), comparative separation criteria over pairwise data are equivalent to pointwise separation in binary tasks, with requisite sample size $n_p \approx 2 n$ (since only half of pairs contribute actionable signal) (Xi et al., 11 Jan 2026).

5. Applications Across Domains

CJ test data underpins a variety of high-impact uses:

Educational assessment: Rapid, scalable ranking of student work; minimal cognitive load on teachers; natural fit for peer review and MOOCs (Gray et al., 2022, Henkel et al., 2023, Gray et al., 17 Mar 2025).
Fairness evaluation in ML: Comparative separation for binary and regression tasks realizes equalized-odds testing entirely via pairwise judgments (Xi et al., 11 Jan 2026).
Human annotation (MT, IR, Essay Scoring): Higher rater agreement in side-by-side and forced-choice settings; robust to annotator variability; efficient aggregation into global rankings (Song et al., 25 Feb 2025, Arabzadeh et al., 16 Apr 2025, Kim et al., 2024).
Similarity psychophysics: Ordinal indices for metric, ultrametric, and additive-tree structures via triplet/tent analysis of forced-choice similarity data (Victor et al., 2023).

6. Data Formats, Processing, and Integration

Comparative judgment datasets generally consist of atomic records such as:

Item A	Item B	Winner	Judge ID	Task Type
essay_17	essay_23	essay_17	j_9	CJ (AES)
pass_A	pass_B	pass_B	ann14	SxS-MQM (MT)

JSON, CSV, or adjacency matrices representing outcomes
Joinable to item metadata/rater pools
Amenable to aggregation via MM updates, Bayesian sampling, or embedding in ML ranking losses (e.g., LambdaRank) (Henkel et al., 2023)
Enables reporting of full rank distributions, item-level uncertainty, and audit trails (Gray et al., 17 Mar 2025, Gray et al., 2023)

For multi-criteria assessment (e.g., BCJ), one records comparisons per criterion, enabling granular or holistic aggregation via weighted utility functions (Gray et al., 17 Mar 2025).

7. Limitations, Challenges, and Future Directions

CJ test data, while robust to some biases, requires careful management of:

Sufficient comparisons per pair to stabilize rankings (Pollitt’s rule: $10 N$ judgments recommended) (Gray et al., 17 Mar 2025)
Calibration (“neutral starts”) for new items or judge pools (Gray et al., 2022)
Sensitivity to design choices (sampling, update rules, $K$ in Elo, prior specification in BCJ) (Gray et al., 2022, Gray et al., 17 Mar 2025)
Generalization of fairness metrics beyond binary labels (regression/multiclass) (Xi et al., 11 Jan 2026)
Data protection and transparency—auditable judgment logs, credible intervals on ranks, heatmaps of assessor disagreement (Gray et al., 17 Mar 2025, Gray et al., 2023)

Recent work continues to extend CJ’s reach into active learning, large-scale annotation, and optimization of pairwise experimental design, ensuring its relevance to educational, scientific, and machine learning practitioners.

Key References:

Gray et al., “Using Elo Rating as a Metric for Comparative Judgement in Educational Assessment” (Gray et al., 2022)
Gray et al., “Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement” (Gray et al., 17 Mar 2025)
“Comparative Separation: Evaluating Separation on Comparative Judgment Test Data” (Xi et al., 11 Jan 2026)
Henkel & Hills, “Leveraging Human Feedback to Scale Educational Datasets” (Henkel et al., 2023)
Shao et al., “Is GPT-4 Alone Sufficient for Automated Essay Scoring? A Comparative Judgement Approach...” (Kim et al., 2024)
Dwork et al., “A Reduced Basis Decomposition Approach to Efficient Data Collection in Pairwise Comparison Studies” (Jiang et al., 22 Dec 2025)
Chen et al., “Enhancing Human Evaluation in Machine Translation with Comparative Judgment” (Song et al., 25 Feb 2025)
Steinhardt & Liang, “When is it Better to Compare than to Score?” (Shah et al., 2014)
Victor & Joulin, “Ordinal Characterization of Similarity Judgments” (Victor et al., 2023)

Markdown Upgrade to Chat

References (11)

A Bayesian Active Learning Approach to Comparative Judgement (2023)

Ordinal Characterization of Similarity Judgments (2023)

Comparative Separation: Evaluating Separation on Comparative Judgment Test Data (2026)

Using Elo Rating as a Metric for Comparative Judgement in Educational Assessment (2022)

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition (2024)

When is it Better to Compare than to Score? (2014)

Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement (2025)

A Reduced Basis Decomposition Approach to Efficient Data Collection in Pairwise Comparison Studies (2025)

Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement (2023)

10.

Enhancing Human Evaluation in Machine Translation with Comparative Judgment (2025)

11.

A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Comparative Judgment Test Data.