MED: Maximized Effectiveness Difference

Updated 27 October 2025

MED is a formal framework that quantifies the worst-case divergence between two ranked lists by optimally assigning unknown relevance labels.
It leverages established metrics such as nDCG, MAP, and ERR to preserve key mathematical properties and directly reflect user model assumptions.
MED supports evaluations with partial or complete relevance judgments, enabling robust comparisons used in search systems and decision-making domains.

Maximized Effectiveness Difference (MED) is a formal framework for measuring the worst-case divergence in effectiveness between two ranked outputs, grounded in established evaluation metrics such as nDCG, MAP, and ERR. By quantifying the maximum possible difference in effectiveness scores via optimal (adversarial) assignment of unknown relevance values, MED provides a principled metric for rank similarity that directly reflects user model assumptions, supports partial or complete relevance information, and preserves key mathematical properties desirable for retrieval system comparison.

1. Definition and Motivation

Maximized Effectiveness Difference (MED) is defined as the maximum absolute difference in effectiveness— $|S(A) - S(B)|$ —between two ranked lists $A$ and $B$ , under a chosen effectiveness measure $S(\cdot)$ . For any ranked lists $A = \langle a_1, ..., a_K \rangle$ and $B = \langle b_1, ..., b_K \rangle$ , where $a_i$ and $b_i$ represent relevance values (grades), MED captures the greatest possible disparity in effectiveness that could result from any feasible assignment of the unknown (unjudged, "free") relevance labels.

This approach enables quantifying changes between retrieval runs without requiring complete relevance judgments. MED is "derived" from conventional effectiveness measures, ensuring consistency in how user behavior (for example, rank discounting, early stopping probability) and relevance are interpreted in the context of similarity/difference measurement.

2. Optimization Problem and Solution Strategies

To compute MED, one solves an optimization problem over the free relevance variables, subject to constraints imposed by predetermined judgments and bound variables (cases where the same document occurs in both lists). The categories of variables are:

Free variables: No prior relevance judgment; values maximally adversarial ( $r_G$ for $A$ , $r_0 = 0$ for $B$ )
Predetermined variables: Fixed from existing judgments
Bound variables: Shared documents require matched relevance assignments ( $a_n \equiv b_m$ when the same document is found at rank $n$ in $A$ and $m$ in $B$ )

For linear measures (e.g., nDCG, RBP), the maximization is analytically straightforward; free variables in $A$ are set to maximal grade, those in $B$ to minimal grade, and bound variables optimized respecting rank order and discount functions. For nonlinear measures (e.g., MAP, ERR), the maximization becomes combinatorial or quadratic. MAP's objective, for example, takes the form $Z^\top Q Z + L^\top Z + F$ , representing a quadratic 0-1 optimization (weighted max cut, NP-complete), solved heuristically (e.g., tabu search). ERR's cascade model, being highly nonlinear and dominated by early ranks, is addressed via brute force search over the top $p$ positions ( $p = 5$ typically suffices).

3. Instantiation with Canonical Effectiveness Measures

MED can be specialized for the following canonical measures:

Effectiveness Measure	MED Implementation Formulation	Notes
nDCG	$S(C) = \frac{1}{\text{IdealDCG}} \sum_{i=1}^k \frac{c_i}{\log(i+1)}$	Direct, linear, log rank discount; free variables: $r_G$ /$0$
MAP	$S(C) = \frac{1}{R} \sum_{i=1}^k \frac{c_i}{i} \sum_{j=1}^i c_j$	Quadratic, 0-1; solved heuristically
ERR	$S(C) = \sum_{i=1}^\infty \frac{c_i}{i} \prod_{j=1}^{i-1}(1-c_j)$	Nonlinear, top-weighted, brute force top $p$

For these measures, MED translates their internal user behavior assumptions (e.g., discounting, persistence) directly into rank difference quantification.

4. Mathematical Properties

MED satisfies the strict axioms of a metric:

Non-negativity: $MED(A,B) \geq 0$
Identity of indiscernibles: $MED(A,A) = 0$
Symmetry: $MED(A,B) = MED(B,A)$
Triangle inequality: $MED(A,B) \leq MED(A,C) + MED(C,B)$

Additionally, MED is top-weighted and defined for indefinite-length rankings, inheriting these properties from the underlying effectiveness measure via rank discounting (e.g., logarithmic, persistence-based). When partial information is available (some relevance judgments predetermined), MED strictly decreases compared to the full adversarial setting. If all relevance judgments are known, MED reduces to the absolute effectiveness difference.

5. Practical Approaches and Experimental Validation

The framework was empirically validated using TREC 2005 Robust Track data. MED distances (e.g., MED-nDCG@20) showed that runs with similar retrieval models or query formulation approaches yielded smaller MED distances than those with substantial algorithmic differences (such as query expansion methods). Further, as partial judgments were incrementally supplied, MED values converged toward the true effectiveness differences, demonstrating robustness to judgment incompleteness.

Comparisons across measures (RBP, nDCG, MAP, ERR) indicated that MED-RBP aligns with rank-biased overlap (RBO), whereas MED-nDCG and MED-ERR reveal the influence of their underlying user models. MED thus adapts its "difference" metric to the behavioral semantics of the chosen effectiveness measure.

6. Translation of User Models and Behavioral Assumptions

A distinguishing feature of MED is its direct inheritance of user behavioral assumptions from the effectiveness measure. For RBP and ERR, for example, the probability of continuing to the next document is encoded in their discount or cascade models. When maximizing MED, free variables are assigned in a manner that yields maximal/minimal user satisfaction—rooted in the same probabilistic model employed by effectiveness scoring.

Thus, MED measures the greatest possible distance between two rankings given the explicit user model, reflecting not just worsening relevance but the downstream impact on real or theoretical user behavior.

7. Handling Relevance Judgments and Partial Information

MED is designed for utility under any degree of relevance information:

No judgments: All variables are free; MED is maximized by contrasting $r_G$ vs. $r_0$
Partial judgments: Predetermined variables lower the possible MED, reflecting decreased uncertainty about effectiveness difference
Complete judgments: MED equals the actual difference in effectiveness scores; "adversarial" assignment unnecessary

This adaptability is crucial for large-scale evaluations where judgments are sparse or expensive and supports incremental updating of MED values as evaluations progress.

MED establishes a unifying methodology for rank similarity measurement, parameterized by user models and effectiveness scores. It offers interpretable, robust, and judgment-independent quantification of search result divergence with guaranteed metric properties. In biomedical treatment effect estimation and decision-making domains (Aoki et al., 2021, Frauen et al., 19 May 2025), analogous formulations—such as minimizing mean absolute error across estimated effects or maximizing downstream policy impact—reflect the underlying principle of maximizing (or minimizing) effectiveness differences for robust evaluation and optimal decision-making.

A plausible implication is that the MED framework's formalization, metric properties, and user-model inheritance provide a generalizable paradigm for worst-case evaluation and comparison wherever effectiveness is derived from ranking, selection, or treatment assignment.