- The paper presents a novel conceptual engineering framework that systematically justifies AI impact assessment metrics.
- It demonstrates how linking normative functions and context-sensitive conceptions leads to selection of ethically grounded metrics.
- The framework advocates for transparency and multi-metric dashboards to ensure accountable and effective AI impact evaluations.
Justifying Metrics in AI Impact Assessments: A Conceptual Engineering Framework
Introduction
Responsible AI deployment crucially depends on rigorous impact assessments undergirded by well-chosen metrics. However, the selection of appropriate metrics is non-trivial, especially for ethical and societal concepts such as fairness, sustainability, and safety. The paper "Measuring the right thing: justifying metrics in AI impact assessments" (2504.05007) provides a comprehensive framework, grounded in conceptual engineering, that systematically connects overarching values to justified metrics. The authors contend that metric selection cannot be resolved through technical or exhaustive enumeration approaches alone; instead, it demands explicit justification that aligns with contextual and normative functions of the concepts at stake.
The Necessity of Justified Metrics
Metric selection in AI impact assessments is consequential, as demonstrated by historical failures to detect harms such as disparate treatment in medical or criminal justice algorithms and negative externalities from optimization (e.g., increased traffic congestion via navigation algorithms). Current practices—such as exhaustive reporting in Model Cards—inevitably fail to adjudicate which among the many possible metrics is most normatively salient. Mathematical impossibility theorems establish that optimization for all competing fairness metrics simultaneously is infeasible, therefore the decision as to "what matters" is unavoidable and must be explicitly motivated.
Algorithms can produce seemingly objective metrics, but such measures often fail to capture the full ethical, social, or legal significance of the system's impact. For example, optimizing for demographic parity in loan decisions may harm marginalized groups by increasing unsustainable lending, whereas optimizing for parity in false positive rates more closely tracks welfare outcomes. Responsible metric selection thus requires more than technical rigor; it needs conceptual grounding that is sensitive to both context and value-laden trade-offs.
Conceptual Engineering as the Foundation
The authors deploy conceptual engineering—a philosophical methodology for scrutinizing and revising the concepts we use—as the core analytic tool for metric justification. Central to this approach is the disambiguation of "concepts" (abstract, general notions such as fairness or responsibility) and "conceptions" (concrete, contextual interpretations such as distributive vs. procedural justice). Conceptual engineering operates through:
- Determination of normative function: Explaining the purpose and ethical role of the target concept in societal context.
- Justification of conception: Selecting the context-sensitive instantiation of the concept that fulfills this purpose.
- Decomposition and operationalization: Translating the selected conception into measurable, justifiable metrics via identification of relevant attributes and their weighting.
The framework is explicitly two-phased: from concept to justified conception, and from conception to justified metric. The process may legitimately yield a plurality of metrics or motivate the subdivision of broad concepts into more actionable sub-concepts (e.g., splitting responsibility into causal and moral responsibility for risk assessment in autonomous vehicles).
Illustrative Case: Fairness Metrics
The framework is concretized using AI fairness as an exemplary domain. The authors analyze two salient conceptions:
- Distributive Justice (Rawlsian): Metrics should tie to the well-being of the least advantaged, emphasizing the impact on marginalized groups. In the context of loan decision algorithms, the framework supports the selection of false positive parity—minimizing costly errors for vulnerable populations—over alternatives like demographic parity, as substantiated by simulation studies (e.g., optimizing demographic parity can backfire, causing net harm) (2504.05007). The Rawlsian conception motivates that fairness metrics track substantive outcomes, not just superficial parity.
- Solidarity in Insurance (Baumann and Loi): The proper conception of fairness depends on the institutional function (risk pooling vs. actuarial fairness). In health insurance, risk solidarity motivates fixed premiums, with fairness gauged by affordability and access. In car insurance, chance solidarity demands sufficiency; premiums must correspond to group-level expected damages, not group-driven parity. This precludes demographic parity as a relevant fairness metric when risk profiles diverge.
In both cases, the appropriateness of the metric is not technical or procedural, but follows from well-justified conceptual analysis tied to ethical and functional reasoning. Moreover, the approach highlights that a single metric is rarely sufficient: multidimensional assessments are often normatively necessary, and conflicts between metrics must be addressed transparently.
Implications, Limitations, and Paths Forward
The framework has significant theoretical and practical consequences:
- Transparency and contestability: By demanding explicit articulation of conceptions and associated metrics, impact assessments become more accountable, contestable, and less prone to ad hoc or opportunistic metric selection (e.g., greenwashing).
- Context sensitivity: The approach avoids false universalism; different contexts necessitate distinct metrics, even for the same underlying concepts.
- Multi-metric necessity: The analysis rejects the sufficiency of monolithic evaluation schemes, motivating richer, multi-metric dashboards and the development of composite indices grounded in justified value pluralism.
However, the framework openly acknowledges difficulties—deriving consensus on conceptions is often contentious, and translation from abstract values to metrics may be underdetermined or require trade-offs that remain subject to reasonable disagreement. Pragmatically, transparency regarding the conception-metric justification and ongoing engagement with stakeholders are necessary to avoid superficial alignment.
Conclusion
Measuring the impact of AI systems is a fundamentally normative and conceptual exercise, not merely an empirical or technical one. "Measuring the right thing: justifying metrics in AI impact assessments" (2504.05007) demonstrates that a defensible approach to metric selection in AI audits requires a systematic, context-aware rationale anchored in conceptual engineering. By sequencing from abstract values to context-specific conceptions and then to metrics, the framework enables both more rigorous and more ethically legitimate AI impact assessments. Wider adoption of such explicit justification methods could substantially elevate both the trustworthiness and actual societal benefit of AI systems, but will require embedding philosophical reasoning within AI governance and regulatory practice.