Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy To Game (2102.13647v3)

Published 26 Feb 2021 in stat.ML, cs.LG, and stat.ME

Abstract: Simulated DAG models may exhibit properties that, perhaps inadvertently, render their structure identifiable and unexpectedly affect structure learning algorithms. Here, we show that marginal variance tends to increase along the causal order for generically sampled additive noise models. We introduce varsortability as a measure of the agreement between the order of increasing marginal variance and the causal order. For commonly sampled graphs and model parameters, we show that the remarkable performance of some continuous structure learning algorithms can be explained by high varsortability and matched by a simple baseline method. Yet, this performance may not transfer to real-world data where varsortability may be moderate or dependent on the choice of measurement scales. On standardized data, the same algorithms fail to identify the ground-truth DAG or its Markov equivalence class. While standardization removes the pattern in marginal variance, we show that data generating processes that incur high varsortability also leave a distinct covariance pattern that may be exploited even after standardization. Our findings challenge the significance of generic benchmarks with independently drawn parameters. The code is available at https://github.com/Scriddie/Varsortability.

Citations (113)

Summary

  • The paper demonstrates that simulated DAG benchmarks exhibit varsortability, biasing the evaluation of causal discovery methods.
  • The authors reveal that algorithms leveraging marginal variance patterns perform well on synthetic data yet may fail on real-world datasets.
  • A simple baseline combining variable sorting and sparse regression underscores vulnerabilities in current causal discovery benchmarking practices.

Expert Analysis: The Challenges of Causal Discovery Benchmarks with Simulated DAGs

The paper "Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy To Game" by Reisach et al. examines the limitations of using simulated Directed Acyclic Graphs (DAGs) as benchmarks for causal discovery algorithms. The authors argue that simulation procedures often unintentionally introduce patterns that can be exploited by structure learning algorithms, thereby misleading the assessment of their capabilities.

The authors introduce the concept of "varsortability," which quantifies the likelihood that the marginal variances of variables increase along the causal order in a simulated dataset. They demonstrate that for commonly sampled graphs and model parameters, a high degree of varsortability is observed. This property can explain the superior performance of certain continuous structure learning algorithms in synthetic settings. However, this performance does not necessarily transfer to real-world data, where the same varsortability cannot be assumed unless the data scales are appropriately known or adjusted.

Importantly, the paper shows that when data is standardized, causing the marginal variance pattern to disappear, these algorithms often fail to accurately reconstruct the underlying causal structure or even its Markov equivalence class. The paper underscores that many benchmarks may inadvertently be gamed because of the regularity patterns in synthetic data, particularly when edge weights and noise parameters are independently and identically drawn.

The authors put forth a simple baseline method based on sorting variables by their marginal variances followed by sparse regression for edge selection—termed as "sortnregress." This minimalist approach exemplifies the extent to which current benchmarks might be biased in favor of methods that exploit marginal variance properties. The results from this approach further suggest that simulation benchmarks requiring more complex causal inference are not fully robust.

A notable strength of the paper lies in its quantitative exploration of how the performance of causal discovery algorithms varies with the level of data standardization and the measurement scales. In particular, the continuous structure learning algorithms perform remarkably well on high-varsortability datasets, leading to insights about their optimization dynamics dominated by patterns of marginal variances.

While the paper primarily addresses linear additive noise models, similar concerns about varsortability extend to non-linear settings, conditioning the authors' call for empirical evaluation and reporting of varsortability for synthetic DAG simulations.

The implications of this research are twofold. Practically, it advises caution in overly relying on simulated benchmarks for evaluating causal discovery algorithms without considering the effects of varsortability. Theoretically, it challenges the assumptions about the generic applicability of simulated causal models to real-world data unless accompanied by appropriate checks against synthetic simulation biases.

In conclusion, while the paper provides a critical lens on how structure learning algorithms are evaluated, it also paves the way for more robust and nuanced benchmarks that emphasize comparability over across various scales and distributions. Movement forward might involve developing simulation methods that reflect the variability and non-ideal conditions found in real-world data, paired with analytical tools to better assess the robustness of causal discovery methods.

Youtube Logo Streamline Icon: https://streamlinehq.com