Method Evaluation Model (MEM)

Updated 10 December 2025

MEM is a comprehensive framework that evaluates AI/ML models by integrating multiple user-defined criteria for holistic assessment.
It employs ordinal ranking and voting methods, including Condorcet and Borda counts, to aggregate heterogeneous performance scores.
Its flexible design adapts to various domains, enabling systematic evaluation of attributes such as accuracy, explainability, and fairness.

The Method Evaluation Model (MEM) is a general evaluation framework for comparing predictive and AI/ML models along multiple, user-defined criteria. Originating from prediction competitions in psychology and decision science, MEM enables holistic model assessment by integrating heterogeneous performance indicators—ranging from predictive accuracy to scientific and practical desiderata—into an aggregated ranking. It leverages principles from computational social choice, employing voting rules for principled aggregation of ordinal rankings across diverse criteria. MEM’s flexibility allows for tailored criterion sets, quantification methods, and weighting schemes, making it broadly applicable to both scientific and applied contexts (Harman et al., 2024).

1. Definition and Core Objectives

MEM evaluates a set of candidate models $M = \{M_1, \ldots, M_k\}$ against a suite of criteria $C = \{C_1, \ldots, C_n\}$ . Each criterion $C_j$ reflects a dimension relevant to scientific, theoretical, or operational quality. Raw performance scores $s_{ij}$ , which may be continuous, discrete, or categorical, are computed for each model-criterion pairing and then mapped to ordinal ranks $r_{ij}$ . MEM’s objectives include:

Rewarding models not only for predictive accuracy but also for attributes such as parsimony, explainability, fairness, or theoretical alignment.
Unifying disparate measures (e.g., continuous, binary, categorical) into a single, holistic ranking.
Employing robust aggregation methods from social choice theory (notably Condorcet and Borda rules) to synthesize ordinal rankings.
Providing a flexible, domain-adaptable architecture for model assessment (Harman et al., 2024).

2. Taxonomy and Quantification of Evaluation Criteria

MEM organizes evaluation criteria into families, enabling broad comparison across several scientific and applied domains. A representative taxonomy for cognitive and decision models includes:

Family	Example Criteria	Quantification
Theoretical	Intuitive understanding; broad scope	Ordinal, subjective
Psychological	Realistic knowledge; process assumptions	Categorical, binary
Scientific	Parsimony; predictive power; testability	Continuous, ordinal
Explainability	Traceability; trust	Binned categories
Ethical	Fairness; adverse impact	Continuous, binary

Quantification requirements are minimal: any criterion $C_j$ must permit ordered ranking of models. Scores $s_{ij}$ may be real-valued (e.g., mean squared error), integer counts (e.g., number of paradigms covered), or binned ratings (e.g., pass/fail). The key condition is the establishment of a strict ordinal structure for each criterion (Harman et al., 2024).

3. Formal Model Comparison Structure

The comparison workflow proceeds as follows:

For each model $M_i$ and criterion $C_j$ , compute the raw score $s_{ij}$ .
Convert $s_{ij}$ to rank $r_{ij} \in \{1, \ldots, k\}$ (where $r=1$ denotes best performance for $C_j$ ).
For each criterion, determine whether lower or higher is better and rank accordingly. Ties are resolved via averaging or explicit tie-breaking.

The resulting rank matrix is subject to aggregation via voting-based rules, yielding a robust composite ordering (Harman et al., 2024).

4. Aggregation via Voting Rules: Condorcet and Borda Methods

MEM exploits two classical aggregation rules:

Condorcet Method: For each model pair $(M_i, M_\ell)$ , compute the count $v_{i\ell} = |\{j : r_{ij} < r_{\ell j}\}|$ . $M_i$ “beats” $M_\ell$ if $v_{i\ell} > v_{\ell i}$ . A Condorcet winner (a model that beats all others in pairwise comparison) is declared winner outright if it exists.
Borda Count: If no Condorcet winner exists, for each model $M_i$ , calculate $B(M_i) = \sum_{j=1}^n (k - r_{ij} + 1)$ . Ranks can be weighted via $W(M_i) = \sum_{j=1}^n w_j r_{ij}$ , with criterion weights $w_j$ . The model with the highest (or, for weighted, lowest) summed score is selected.

This structure ensures transparent, interpretable outcome aggregation across diverse metrics (Harman et al., 2024).

5. Implementation Workflow and Pseudocode

The standard MEM workflow can be summarized as follows:

Input: Models M = {M₁,…,Mₖ}, Criteria C = {C₁,…,Cₙ}, raw-score functions score(i,j), optional weights w_j

1. For each criterion j=1…n:
   - For each model i=1…k:
        Compute s[i,j] = score(i,j)
   - Derive ranks r[1…k,j] from s[·,j] (1 = best, …, k = worst)

2. Condorcet stage:
   - For each pair (i, ℓ), i ≠ ℓ:
         v[i,ℓ] = |{ j : r[i,j] < r[ℓ,j] }|
   - Find i* such that for all ℓ ≠ i*: v[i*,ℓ] > v[ℓ,i*]
   - If such i* exists:
        Return Winner = M_{i*}

3. Borda stage:
   - For each model i:
        B[i] = sum_{j=1}^n (k − r[i,j] + 1)
   - Select i** = argmax_i B[i]
   - Return Winner = M_{i**}

The model selection pipeline is thus fully deterministic and transparent, permitting customizations in ranking direction, tie-handling, and criterion weights (Harman et al., 2024).

6. Empirical Applications and Performance

MEM has been demonstrated in large-scale prediction competitions:

In the Choice Prediction Competition (CPC, psychology), 25 models were evaluated across 5+ criteria (e.g., predictive mean squared deviation, parsimony, process identifiability) using binary, continuous, and ordinal metrics. Absence of a Condorcet winner led to Borda aggregation, which revealed that simple baseline models could outperform complex ML methods under multi-criteria scrutiny (Harman et al., 2024).
In the SIOP Personnel Selection Competition, MEM-inspired ranking on retention/performance and adverse-impact metrics led to a competitive finish (9th out of over 60 algorithms). This underscored the operational competitiveness of explainable, MEM-grounded rules (Harman et al., 2024).

These results substantiate MEM's capacity to reveal trade-offs and nuanced model advantages not apparent under single-metric evaluation.

7. Advantages, Limitations, and Prospects

MEM provides a holistic evaluation architecture, avoiding overemphasis on any single criterion. Advantages include:

Cross-criteria integration (scientific, theoretical, practical) in a principled ranking.
Incentivization of model design for breadth (e.g., parsimony, interpretability, fairness).
Transparent aggregation grounded in well-studied social choice theory.
High flexibility in criterion definition, weighting, and quantification.

Potential extensions include adopting additional social-choice voting rules (e.g., Schulze, Ranked Pairs), multi-objective optimization targeting MEM criteria, Pareto-front exploration for non-dominated solutions, and dynamic criterion weighting informed by stakeholder negotiation or domain shifts. A plausible implication is that MEM’s applicability may further expand with integrations into continuous optimization and robust social-choice schemes (Harman et al., 2024).

For implementing defensible and multifaceted model choices, practitioners define domain-specific criteria, quantify models’ raw performances, compute ordinal rankings, and apply Condorcet/Borda aggregation as outlined.

For details, refer to "Multi-Criteria Comparison as a Method of Advancing Knowledge-Guided Machine Learning" (Harman et al., 2024).

Markdown Upgrade to Chat

References (1)

Multi-Criteria Comparison as a Method of Advancing Knowledge-Guided Machine Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Method Evaluation Model (MEM).