Papers
Topics
Authors
Recent
2000 character limit reached

Method Evaluation Model (MEM)

Updated 10 December 2025
  • MEM is a comprehensive framework that evaluates AI/ML models by integrating multiple user-defined criteria for holistic assessment.
  • It employs ordinal ranking and voting methods, including Condorcet and Borda counts, to aggregate heterogeneous performance scores.
  • Its flexible design adapts to various domains, enabling systematic evaluation of attributes such as accuracy, explainability, and fairness.

The Method Evaluation Model (MEM) is a general evaluation framework for comparing predictive and AI/ML models along multiple, user-defined criteria. Originating from prediction competitions in psychology and decision science, MEM enables holistic model assessment by integrating heterogeneous performance indicators—ranging from predictive accuracy to scientific and practical desiderata—into an aggregated ranking. It leverages principles from computational social choice, employing voting rules for principled aggregation of ordinal rankings across diverse criteria. MEM’s flexibility allows for tailored criterion sets, quantification methods, and weighting schemes, making it broadly applicable to both scientific and applied contexts (Harman et al., 18 Mar 2024).

1. Definition and Core Objectives

MEM evaluates a set of candidate models M={M1,,Mk}M = \{M_1, \ldots, M_k\} against a suite of criteria C={C1,,Cn}C = \{C_1, \ldots, C_n\}. Each criterion CjC_j reflects a dimension relevant to scientific, theoretical, or operational quality. Raw performance scores sijs_{ij}, which may be continuous, discrete, or categorical, are computed for each model-criterion pairing and then mapped to ordinal ranks rijr_{ij}. MEM’s objectives include:

  • Rewarding models not only for predictive accuracy but also for attributes such as parsimony, explainability, fairness, or theoretical alignment.
  • Unifying disparate measures (e.g., continuous, binary, categorical) into a single, holistic ranking.
  • Employing robust aggregation methods from social choice theory (notably Condorcet and Borda rules) to synthesize ordinal rankings.
  • Providing a flexible, domain-adaptable architecture for model assessment (Harman et al., 18 Mar 2024).

2. Taxonomy and Quantification of Evaluation Criteria

MEM organizes evaluation criteria into families, enabling broad comparison across several scientific and applied domains. A representative taxonomy for cognitive and decision models includes:

Family Example Criteria Quantification
Theoretical Intuitive understanding; broad scope Ordinal, subjective
Psychological Realistic knowledge; process assumptions Categorical, binary
Scientific Parsimony; predictive power; testability Continuous, ordinal
Explainability Traceability; trust Binned categories
Ethical Fairness; adverse impact Continuous, binary

Quantification requirements are minimal: any criterion CjC_j must permit ordered ranking of models. Scores sijs_{ij} may be real-valued (e.g., mean squared error), integer counts (e.g., number of paradigms covered), or binned ratings (e.g., pass/fail). The key condition is the establishment of a strict ordinal structure for each criterion (Harman et al., 18 Mar 2024).

3. Formal Model Comparison Structure

The comparison workflow proceeds as follows:

  • For each model MiM_i and criterion CjC_j, compute the raw score sijs_{ij}.
  • Convert sijs_{ij} to rank rij{1,,k}r_{ij} \in \{1, \ldots, k\} (where r=1r=1 denotes best performance for CjC_j).
  • For each criterion, determine whether lower or higher is better and rank accordingly. Ties are resolved via averaging or explicit tie-breaking.

The resulting rank matrix is subject to aggregation via voting-based rules, yielding a robust composite ordering (Harman et al., 18 Mar 2024).

4. Aggregation via Voting Rules: Condorcet and Borda Methods

MEM exploits two classical aggregation rules:

  • Condorcet Method: For each model pair (Mi,M)(M_i, M_\ell), compute the count vi={j:rij<rj}v_{i\ell} = |\{j : r_{ij} < r_{\ell j}\}|. MiM_i “beats” MM_\ell if vi>viv_{i\ell} > v_{\ell i}. A Condorcet winner (a model that beats all others in pairwise comparison) is declared winner outright if it exists.
  • Borda Count: If no Condorcet winner exists, for each model MiM_i, calculate B(Mi)=j=1n(krij+1)B(M_i) = \sum_{j=1}^n (k - r_{ij} + 1). Ranks can be weighted via W(Mi)=j=1nwjrijW(M_i) = \sum_{j=1}^n w_j r_{ij}, with criterion weights wjw_j. The model with the highest (or, for weighted, lowest) summed score is selected.

This structure ensures transparent, interpretable outcome aggregation across diverse metrics (Harman et al., 18 Mar 2024).

5. Implementation Workflow and Pseudocode

The standard MEM workflow can be summarized as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Input: Models M = {M₁,…,Mₖ}, Criteria C = {C₁,…,Cₙ}, raw-score functions score(i,j), optional weights w_j

1. For each criterion j=1…n:
   - For each model i=1…k:
        Compute s[i,j] = score(i,j)
   - Derive ranks r[1…k,j] from s[·,j] (1 = best, …, k = worst)

2. Condorcet stage:
   - For each pair (i, ℓ), i ≠ ℓ:
         v[i,ℓ] = |{ j : r[i,j] < r[ℓ,j] }|
   - Find i* such that for all ℓ ≠ i*: v[i*,ℓ] > v[ℓ,i*]
   - If such i* exists:
        Return Winner = M_{i*}

3. Borda stage:
   - For each model i:
        B[i] = sum_{j=1}^n (k − r[i,j] + 1)
   - Select i** = argmax_i B[i]
   - Return Winner = M_{i**}

The model selection pipeline is thus fully deterministic and transparent, permitting customizations in ranking direction, tie-handling, and criterion weights (Harman et al., 18 Mar 2024).

6. Empirical Applications and Performance

MEM has been demonstrated in large-scale prediction competitions:

  • In the Choice Prediction Competition (CPC, psychology), 25 models were evaluated across 5+ criteria (e.g., predictive mean squared deviation, parsimony, process identifiability) using binary, continuous, and ordinal metrics. Absence of a Condorcet winner led to Borda aggregation, which revealed that simple baseline models could outperform complex ML methods under multi-criteria scrutiny (Harman et al., 18 Mar 2024).
  • In the SIOP Personnel Selection Competition, MEM-inspired ranking on retention/performance and adverse-impact metrics led to a competitive finish (9th out of over 60 algorithms). This underscored the operational competitiveness of explainable, MEM-grounded rules (Harman et al., 18 Mar 2024).

These results substantiate MEM's capacity to reveal trade-offs and nuanced model advantages not apparent under single-metric evaluation.

7. Advantages, Limitations, and Prospects

MEM provides a holistic evaluation architecture, avoiding overemphasis on any single criterion. Advantages include:

  • Cross-criteria integration (scientific, theoretical, practical) in a principled ranking.
  • Incentivization of model design for breadth (e.g., parsimony, interpretability, fairness).
  • Transparent aggregation grounded in well-studied social choice theory.
  • High flexibility in criterion definition, weighting, and quantification.

Potential extensions include adopting additional social-choice voting rules (e.g., Schulze, Ranked Pairs), multi-objective optimization targeting MEM criteria, Pareto-front exploration for non-dominated solutions, and dynamic criterion weighting informed by stakeholder negotiation or domain shifts. A plausible implication is that MEM’s applicability may further expand with integrations into continuous optimization and robust social-choice schemes (Harman et al., 18 Mar 2024).

For implementing defensible and multifaceted model choices, practitioners define domain-specific criteria, quantify models’ raw performances, compute ordinal rankings, and apply Condorcet/Borda aggregation as outlined.


For details, refer to "Multi-Criteria Comparison as a Method of Advancing Knowledge-Guided Machine Learning" (Harman et al., 18 Mar 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Method Evaluation Model (MEM).