Externally Valid Selection of Experimental Sites via the k-Median Problem (2408.09187v1)

Published 17 Aug 2024 in econ.EM

Abstract: We present a decision-theoretic justification for viewing the question of how to best choose where to experiment in order to optimize external validity as a k-median (clustering) problem, a popular problem in computer science and operations research. We present conditions under which minimizing the worst-case, welfare-based regret among all nonrandom schemes that select k sites to experiment is approximately equal - and sometimes exactly equal - to finding the k most central vectors of baseline site-level covariates. The k-median problem can be formulated as a linear integer program. Two empirical applications illustrate the theoretical and computational benefits of the suggested procedure.

Summary

The paper proposes selecting experimental sites via the k-median problem to optimize external validity, linking statistical decision theory and experimental design.
It establishes conditions under which minimax-regret optimal site selection corresponds to solving the k-median problem, detailing computational strategies like integer programming.
Case studies on migration and surveys demonstrate the framework's practical use in policy evaluations and its potential for domain adaptation and econometrics.

Essay on "Externally Valid Selection of Experimental Sites via the k-Median Problem"

This paper presents a novel framework for selecting experimental sites that optimize external validity, formulated as a k-median problem. The authors provide a decision-theoretic justification for viewing site selection as a k-median clustering problem, which connects theoretical insights from statistical decision theory with practical considerations in experimental design.

Core Proposition

The central argument of the paper is that minimizing the worst-case welfare-based regret in non-random site selection schemes—purposive sampling—is equivalent to solving a k-median problem, a well-known construct in computer science and operations research. The k-median problem involves choosing k facilities (experimental sites) that minimize the total connection cost to a set of clients (policy-relevant sites). This connection cost is defined by the Euclidean distance between site-level covariates.

Key Results and Methodological Contributions

The paper establishes conditions under which the regret minimization objective function matches the k-median solution. These conditions are primarily related to site-specific covariate vectors and their distribution. Notably, the authors demonstrate that when the experimental and policy sites are disjoint and the treatment effect heterogeneity is significant across sites, the k-median problem's solution is exactly minimax-regret optimal.

Theoretical Underpinnings: Through Theorem 1, the paper links external validity concerns in treatment choice frameworks to the k-median problem, capitalizing on its NP-hard characteristics while providing efficient computational strategies through integer programming.
Algorithmic Implementation: Solving the k-median problem in this context is facilitated by linear integer program formulations that allow the use of powerful solvers to find optimal site selections. The authors also discuss approximation algorithms useful in case of large datasets.
Empirical Applications: Two case studies underscore the practical implications of the framework. The first focuses on migration corridors in Bangladesh, illustrating how different selection criteria yield distinct site selections. The second examines multi-country survey experiments in Europe, showcasing the method's potential to accommodate varying degrees of site-level covariate heterogeneity.

Implications and Potential Developments

This work offers substantial implications for the design of experiments, notably in policy evaluations and econometrics. By operationalizing site selection via a well-defined computational problem, researchers can better ensure the external validity of their experimental outcomes. This approach also holds promise for applications beyond randomized controlled trials, extending into areas such as domain adaptation in machine learning and econometric estimations under biased sampling.

Future research could explore the efficacy of randomized site selection schemes in this framework, which the paper touches on but does not explore deeply. Moreover, further theoretical development could establish stronger connections between this approach and synthetic control methods, potentially leading to an integrated strategy combining clustering and donor weighting to enhance external validity.

Although the applicability of integer programming is demonstrated convincingly, the computational burden may necessitate further advancements, particularly as the scale of potential experimental sites grows. Integrating more flexible, perhaps heuristic-based algorithms might further enhance the approach proposed here.

In conclusion, this paper provides a robust framework for site selection aimed at optimizing external validity, leveraging the computational and theoretical tools of the k-median problem. It represents a significant methodological contribution to experimental design, with numerous pathways for extension and refinement in both theoretical and applied settings.