Measuring the right thing: justifying metrics in AI impact assessments

Published 7 Apr 2025 in cs.CY, cs.AI, and cs.ET | (2504.05007v1)

Abstract: AI Impact Assessments are only as good as the measures used to assess the impact of these systems. It is therefore paramount that we can justify our choice of metrics in these assessments, especially for difficult to quantify ethical and social values. We present a two-step approach to ensure metrics are properly motivated. First, a conception needs to be spelled out (e.g. Rawlsian fairness or fairness as solidarity) and then a metric can be fitted to that conception. Both steps require separate justifications, as conceptions can be judged on how well they fit with the function of, for example, fairness. We argue that conceptual engineering offers helpful tools for this step. Second, metrics need to be fitted to a conception. We illustrate this process through an examination of competing fairness metrics to illustrate that here the additional content that a conception offers helps us justify the choice for a specific metric. We thus advocate that impact assessments are not only clear on their metrics, but also on the conceptions that motivate those metrics.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel conceptual engineering framework that systematically justifies AI impact assessment metrics.
It demonstrates how linking normative functions and context-sensitive conceptions leads to selection of ethically grounded metrics.
The framework advocates for transparency and multi-metric dashboards to ensure accountable and effective AI impact evaluations.

Justifying Metrics in AI Impact Assessments: A Conceptual Engineering Framework

Introduction

Responsible AI deployment crucially depends on rigorous impact assessments undergirded by well-chosen metrics. However, the selection of appropriate metrics is non-trivial, especially for ethical and societal concepts such as fairness, sustainability, and safety. The paper "Measuring the right thing: justifying metrics in AI impact assessments" (2504.05007) provides a comprehensive framework, grounded in conceptual engineering, that systematically connects overarching values to justified metrics. The authors contend that metric selection cannot be resolved through technical or exhaustive enumeration approaches alone; instead, it demands explicit justification that aligns with contextual and normative functions of the concepts at stake.

The Necessity of Justified Metrics

Metric selection in AI impact assessments is consequential, as demonstrated by historical failures to detect harms such as disparate treatment in medical or criminal justice algorithms and negative externalities from optimization (e.g., increased traffic congestion via navigation algorithms). Current practices—such as exhaustive reporting in Model Cards—inevitably fail to adjudicate which among the many possible metrics is most normatively salient. Mathematical impossibility theorems establish that optimization for all competing fairness metrics simultaneously is infeasible, therefore the decision as to "what matters" is unavoidable and must be explicitly motivated.

Algorithms can produce seemingly objective metrics, but such measures often fail to capture the full ethical, social, or legal significance of the system's impact. For example, optimizing for demographic parity in loan decisions may harm marginalized groups by increasing unsustainable lending, whereas optimizing for parity in false positive rates more closely tracks welfare outcomes. Responsible metric selection thus requires more than technical rigor; it needs conceptual grounding that is sensitive to both context and value-laden trade-offs.

Conceptual Engineering as the Foundation

The authors deploy conceptual engineering—a philosophical methodology for scrutinizing and revising the concepts we use—as the core analytic tool for metric justification. Central to this approach is the disambiguation of "concepts" (abstract, general notions such as fairness or responsibility) and "conceptions" (concrete, contextual interpretations such as distributive vs. procedural justice). Conceptual engineering operates through:

Determination of normative function: Explaining the purpose and ethical role of the target concept in societal context.
Justification of conception: Selecting the context-sensitive instantiation of the concept that fulfills this purpose.
Decomposition and operationalization: Translating the selected conception into measurable, justifiable metrics via identification of relevant attributes and their weighting.

The framework is explicitly two-phased: from concept to justified conception, and from conception to justified metric. The process may legitimately yield a plurality of metrics or motivate the subdivision of broad concepts into more actionable sub-concepts (e.g., splitting responsibility into causal and moral responsibility for risk assessment in autonomous vehicles).

Illustrative Case: Fairness Metrics

The framework is concretized using AI fairness as an exemplary domain. The authors analyze two salient conceptions:

Distributive Justice (Rawlsian): Metrics should tie to the well-being of the least advantaged, emphasizing the impact on marginalized groups. In the context of loan decision algorithms, the framework supports the selection of false positive parity—minimizing costly errors for vulnerable populations—over alternatives like demographic parity, as substantiated by simulation studies (e.g., optimizing demographic parity can backfire, causing net harm) (2504.05007). The Rawlsian conception motivates that fairness metrics track substantive outcomes, not just superficial parity.
Solidarity in Insurance (Baumann and Loi): The proper conception of fairness depends on the institutional function (risk pooling vs. actuarial fairness). In health insurance, risk solidarity motivates fixed premiums, with fairness gauged by affordability and access. In car insurance, chance solidarity demands sufficiency; premiums must correspond to group-level expected damages, not group-driven parity. This precludes demographic parity as a relevant fairness metric when risk profiles diverge.

In both cases, the appropriateness of the metric is not technical or procedural, but follows from well-justified conceptual analysis tied to ethical and functional reasoning. Moreover, the approach highlights that a single metric is rarely sufficient: multidimensional assessments are often normatively necessary, and conflicts between metrics must be addressed transparently.

Implications, Limitations, and Paths Forward

The framework has significant theoretical and practical consequences:

Transparency and contestability: By demanding explicit articulation of conceptions and associated metrics, impact assessments become more accountable, contestable, and less prone to ad hoc or opportunistic metric selection (e.g., greenwashing).
Context sensitivity: The approach avoids false universalism; different contexts necessitate distinct metrics, even for the same underlying concepts.
Multi-metric necessity: The analysis rejects the sufficiency of monolithic evaluation schemes, motivating richer, multi-metric dashboards and the development of composite indices grounded in justified value pluralism.

However, the framework openly acknowledges difficulties—deriving consensus on conceptions is often contentious, and translation from abstract values to metrics may be underdetermined or require trade-offs that remain subject to reasonable disagreement. Pragmatically, transparency regarding the conception-metric justification and ongoing engagement with stakeholders are necessary to avoid superficial alignment.

Conclusion

Measuring the impact of AI systems is a fundamentally normative and conceptual exercise, not merely an empirical or technical one. "Measuring the right thing: justifying metrics in AI impact assessments" (2504.05007) demonstrates that a defensible approach to metric selection in AI audits requires a systematic, context-aware rationale anchored in conceptual engineering. By sequencing from abstract values to context-specific conceptions and then to metrics, the framework enables both more rigorous and more ethically legitimate AI impact assessments. Wider adoption of such explicit justification methods could substantially elevate both the trustworthiness and actual societal benefit of AI systems, but will require embedding philosophical reasoning within AI governance and regulatory practice.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Measuring the right thing: justifying metrics in AI impact assessments

Summary

Justifying Metrics in AI Impact Assessments: A Conceptual Engineering Framework

Introduction

The Necessity of Justified Metrics

Conceptual Engineering as the Foundation

Illustrative Case: Fairness Metrics

Implications, Limitations, and Paths Forward

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

Measuring the right thing: justifying metrics in AI impact assessments

Summary

Justifying Metrics in AI Impact Assessments: A Conceptual Engineering Framework

Introduction

The Necessity of Justified Metrics

Conceptual Engineering as the Foundation

Illustrative Case: Fairness Metrics

Implications, Limitations, and Paths Forward

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections