- The paper demonstrates that simulated DAG benchmarks exhibit varsortability, biasing the evaluation of causal discovery methods.
- The authors reveal that algorithms leveraging marginal variance patterns perform well on synthetic data yet may fail on real-world datasets.
- A simple baseline combining variable sorting and sparse regression underscores vulnerabilities in current causal discovery benchmarking practices.
Expert Analysis: The Challenges of Causal Discovery Benchmarks with Simulated DAGs
The paper "Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy To Game" by Reisach et al. examines the limitations of using simulated Directed Acyclic Graphs (DAGs) as benchmarks for causal discovery algorithms. The authors argue that simulation procedures often unintentionally introduce patterns that can be exploited by structure learning algorithms, thereby misleading the assessment of their capabilities.
The authors introduce the concept of "varsortability," which quantifies the likelihood that the marginal variances of variables increase along the causal order in a simulated dataset. They demonstrate that for commonly sampled graphs and model parameters, a high degree of varsortability is observed. This property can explain the superior performance of certain continuous structure learning algorithms in synthetic settings. However, this performance does not necessarily transfer to real-world data, where the same varsortability cannot be assumed unless the data scales are appropriately known or adjusted.
Importantly, the paper shows that when data is standardized, causing the marginal variance pattern to disappear, these algorithms often fail to accurately reconstruct the underlying causal structure or even its Markov equivalence class. The paper underscores that many benchmarks may inadvertently be gamed because of the regularity patterns in synthetic data, particularly when edge weights and noise parameters are independently and identically drawn.
The authors put forth a simple baseline method based on sorting variables by their marginal variances followed by sparse regression for edge selection—termed as "sortnregress." This minimalist approach exemplifies the extent to which current benchmarks might be biased in favor of methods that exploit marginal variance properties. The results from this approach further suggest that simulation benchmarks requiring more complex causal inference are not fully robust.
A notable strength of the paper lies in its quantitative exploration of how the performance of causal discovery algorithms varies with the level of data standardization and the measurement scales. In particular, the continuous structure learning algorithms perform remarkably well on high-varsortability datasets, leading to insights about their optimization dynamics dominated by patterns of marginal variances.
While the paper primarily addresses linear additive noise models, similar concerns about varsortability extend to non-linear settings, conditioning the authors' call for empirical evaluation and reporting of varsortability for synthetic DAG simulations.
The implications of this research are twofold. Practically, it advises caution in overly relying on simulated benchmarks for evaluating causal discovery algorithms without considering the effects of varsortability. Theoretically, it challenges the assumptions about the generic applicability of simulated causal models to real-world data unless accompanied by appropriate checks against synthetic simulation biases.
In conclusion, while the paper provides a critical lens on how structure learning algorithms are evaluated, it also paves the way for more robust and nuanced benchmarks that emphasize comparability over across various scales and distributions. Movement forward might involve developing simulation methods that reflect the variability and non-ideal conditions found in real-world data, paired with analytical tools to better assess the robustness of causal discovery methods.