Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$\texttt{causalAssembly}$: Generating Realistic Production Data for Benchmarking Causal Discovery (2306.10816v2)

Published 19 Jun 2023 in stat.ML, cs.LG, and stat.ME

Abstract: Algorithms for causal discovery have recently undergone rapid advances and increasingly draw on flexible nonparametric methods to process complex data. With these advances comes a need for adequate empirical validation of the causal relationships learned by different algorithms. However, for most real data sources true causal relations remain unknown. This issue is further compounded by privacy concerns surrounding the release of suitable high-quality data. To help address these challenges, we gather a complex dataset comprising measurements from an assembly line in a manufacturing context. This line consists of numerous physical processes for which we are able to provide ground truth causal relationships on the basis of a detailed study of the underlying physics. We use the assembly line data and associated ground truth information to build a system for generation of semisynthetic manufacturing data that supports benchmarking of causal discovery methods. To accomplish this, we employ distributional random forests in order to flexibly estimate and represent conditional distributions that may be combined into joint distributions that strictly adhere to a causal model over the observed variables. The estimated conditionals and tools for data generation are made available in our Python library $\texttt{causalAssembly}$. Using the library, we showcase how to benchmark several well-known causal discovery algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. A characterization of Markov equivalence classes for acyclic digraphs. Ann. Statist., 25(2):505–541, 1997. ISSN 0090-5364,2168-8966. 10.1214/aos/1031833662. URL https://doi.org/10.1214/aos/1031833662.
  2. Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 10.1023/a:1010933404324. URL https://doi.org/10.1023/a:1010933404324.
  3. Causal structure-based root cause analysis of outliers. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2357–2369. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/budhathoki22a.html.
  4. CAM: causal additive models, high-dimensional order search and penalized regression. Ann. Statist., 42(6):2526–2556, 2014. ISSN 0090-5364,2168-8966. 10.1214/14-AOS1260. URL https://doi.org/10.1214/14-AOS1260.
  5. Distributional random forests: Heterogeneity adjustment and multivariate distributional regression. Journal of Machine Learning Research, 23(333):1–79, 2022. URL http://jmlr.org/papers/v23/21-0585.html.
  6. The structurally complex with additive parent causality (scary) dataset. In 2nd Conference on Causal Learning and Reasoning, 2023.
  7. On causal discovery with an equal-variance assumption. Biometrika, 106(4):973–980, 2019. ISSN 0006-3444. 10.1093/biomet/asz049. URL https://doi-org.eaccess.tum.edu/10.1093/biomet/asz049.
  8. David Maxwell Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3:507–554, 2003. ISSN 1532-4435. 10.1162/153244303321897717. URL https://doi.org/10.1162/153244303321897717.
  9. Order-independent constraint-based causal structure learning. J. Mach. Learn. Res., 15:3741–3782, 2014. ISSN 1532-4435,1533-7928.
  10. Large-scale empirical validation of Bayesian network structure learning algorithms with noisy data. International Journal of Approximate Reasoning, 131:151–188, 2021. 10.1016/j.ijar.2021.01.001. URL https://doi.org/10.1016/j.ijar.2021.01.001.
  11. SynTReN: A generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics, 7(1), 2006. 10.1186/1471-2105-7-43. URL https://doi.org/10.1186/1471-2105-7-43.
  12. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4(1):365–393, 2017. 10.1146/annurev-statistics-060116-053803. URL https://doi.org/10.1146/annurev-statistics-060116-053803.
  13. Evaluation of causal structure learning algorithms via risk estimation. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124 of Proceedings of Machine Learning Research, pages 151–160. PMLR, 2020. URL https://proceedings.mlr.press/v124/eigenmann20a.html.
  14. Characterization and greedy learning of Gaussian structural causal models under unknown interventions. arXiv preprint arXiv:2211.14897, 2022.
  15. The case for evaluating causal models using interventional measures and empirical data. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/a87c11b9100c608b7f8e98cfa316ff7b-Paper.pdf.
  16. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10, June 2019. 10.3389/fgene.2019.00524. URL https://doi.org/10.3389/fgene.2019.00524.
  17. A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/paper_files/paper/2006/file/e9fb2eda3d9c55a0d89c98d6c54b5b3e-Paper.pdf.
  18. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/dbe272bab69f8e13f14b405e038deb64-Paper.pdf.
  19. A branch-and-cut approach to the directed acyclic graph layering problem. In Graph drawing, volume 2528 of Lecture Notes in Comput. Sci., pages 98–109. Springer, Berlin, 2002. ISBN 3-540-00158-1. 10.1007/3-540-36151-0{_10. URL https://doi.org/10.1007/3-540-36151-0_10.
  20. Causal structure learning. Annual Review of Statistics and Its Application, 5(1):371–391, 2018. 10.1146/annurev-statistics-031017-100630. URL https://doi.org/10.1146/annurev-statistics-031017-100630.
  21. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper_files/paper/2008/file/f7664060cc52bc6f3d620bcedc94a4b6-Paper.pdf.
  22. Manm-cs: Data generation for benchmarking causal structure learning from mixed discrete-continuous and nonlinear data. In WHY-21 @ NeurIPS 2021, 2021.
  23. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. J. Mach. Learn. Res., 14:111–152, 2013. ISSN 1532-4435,1533-7928.
  24. A. N Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari, 4:83–91, 1933. URL https://cir.nii.ac.jp/crid/1571135650766370304.
  25. Gradient-based neural DAG learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rklbKA4YDS.
  26. Steffen L. Lauritzen. Graphical models, volume 17 of Oxford Statistical Science Series. The Clarendon Press, Oxford University Press, New York, 1996. ISBN 0-19-852219-3. Oxford Science Publications.
  27. Handbook of graphical models. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, FL, 2019. ISBN 978-1-4987-8862-5.
  28. Wisdom of crowds for robust gene network inference. Nature Methods, 9(8):796–804, 2012. 10.1038/nmeth.2016. URL https://doi.org/10.1038/nmeth.2016.
  29. Scalable causal discovery with score matching. In Mihaela van der Schaar, Cheng Zhang, and Dominik Janzing, editors, Proceedings of the Second Conference on Causal Learning and Reasoning, volume 213 of Proceedings of Machine Learning Research, pages 752–771. PMLR, 11–14 Apr 2023. URL https://proceedings.mlr.press/v213/montagna23b.html.
  30. Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research, 17(32):1–102, 2016. URL http://jmlr.org/papers/v17/14-518.html.
  31. On the role of sparsity and DAG constraints for learning linear DAGs. In Advances in Neural Information Processing Systems, volume 33, pages 17943–17954. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/d04d42cdf14579cd294e5079e0745411-Paper.pdf.
  32. Judea Pearl. Causality. Cambridge University Press, Cambridge, second edition, 2009. ISBN 978-0-521-89560-6; 0-521-77362-8. 10.1017/CBO9780511803161. URL https://doi.org/10.1017/CBO9780511803161. Models, reasoning, and inference.
  33. J. Peters and P. Bühlmann. Identifiability of Gaussian structural equation models with equal error variances. Biometrika, 101(1):219–228, 2014. ISSN 0006-3444,1464-3510. 10.1093/biomet/ast043. URL https://doi.org/10.1093/biomet/ast043.
  34. Structural intervention distance for evaluating causal graphs. Neural Comput., 27(3):771–799, 2015. ISSN 0899-7667,1530-888X. 10.1162/neco_a_00708. URL https://doi.org/10.1162/neco_a_00708.
  35. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15(58):2009–2053, 2014. URL http://jmlr.org/papers/v15/peters14a.html.
  36. Elements of Causal Inference: Foundations and Learning Algorithms. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2017. ISBN 978-0-262-03731-0. URL https://mitpress.mit.edu/books/elements-causal-inference.
  37. Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol., 71(5):1009–1030, 2009. ISSN 1369-7412,1467-9868. 10.1111/j.1467-9868.2009.00718.x. URL https://doi.org/10.1111/j.1467-9868.2009.00718.x.
  38. Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. In Advances in Neural Information Processing Systems 34 (NeurIPS), pages 1–13. NeurIPS Proceedings, 2021.
  39. Simple sorting criteria help find the causal order in additive noise models. arXiv preprint arXiv:2303.18211, 2023.
  40. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell, 185(14):2559–2575.e28, 2022. 10.1016/j.cell.2022.05.013. URL https://doi.org/10.1016/j.cell.2022.05.013.
  41. Benchpress: A scalable and versatile workflow for benchmarking structure learning algorithms. arXiv preprint arXiv:2107.03863, 2022.
  42. Score matching enables causal discovery of nonlinear additive noise models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 18741–18753. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/rolland22a.html.
  43. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005. 10.1126/science.1105809. URL https://www.science.org/doi/abs/10.1126/science.1105809.
  44. GeneNetWeaver: In silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 27(16):2263–2270, 2011. ISSN 1367-4803. 10.1093/bioinformatics/btr373. URL https://doi.org/10.1093/bioinformatics/btr373.
  45. M. Scutari. Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35(3):1–22, 2010. URL http://www.jstatsoft.org/v35/i03/.
  46. Shohei. Shimizu. Statistical causal discovery: LiNGAM approach. JSS Research Series in Statistics. Springer Japan, Tokyo, 1st ed. 2022. edition, 2022. ISBN 9784431557845.
  47. A linear non-Gaussian acyclic model for causal discovery. J. Mach. Learn. Res., 7:2003–2030, 2006. ISSN 1532-4435,1533-7928.
  48. DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. Journal of Machine Learning Research, 12(33):1225–1248, 2011. URL http://jmlr.org/papers/v12/shimizu11a.html.
  49. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62–72, 1991. 10.1177/089443939100900106. URL https://doi.org/10.1177/089443939100900106.
  50. Causation, prediction, and search, volume 81 of Lecture Notes in Statistics. Springer-Verlag, New York, 1993. ISBN 0-387-97979-4. 10.1007/978-1-4612-2748-9. URL https://doi.org/10.1007/978-1-4612-2748-9.
  51. Geometry of the faithfulness assumption in causal inference. Ann. Statist., 41(2):436–463, 2013. ISSN 0090-5364,2168-8966. 10.1214/12-AOS1080. URL https://doi.org/10.1214/12-AOS1080.
  52. D’ya like DAGs? A survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1–36, 2022. 10.1145/3527154. URL https://doi.org/10.1145/3527154.
  53. Causal discovery in manufacturing: A structured literature review. Journal of Manufacturing and Materials Processing, 6(1), 2022. ISSN 2504-4494. 10.3390/jmmp6010010. URL https://www.mdpi.com/2504-4494/6/1/10.
  54. DAGs with no curl: An efficient DAG structure learning approach. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12156–12166. PMLR, 2021. URL https://proceedings.mlr.press/v139/yu21a.html.
  55. A survey on causal discovery: Theory and practice. International Journal of Approximate Reasoning, 151:101–129, 2022. ISSN 0888-613X. https://doi.org/10.1016/j.ijar.2022.09.004. URL https://www.sciencedirect.com/science/article/pii/S0888613X22001402.
  56. Boosting causal discovery via adaptive sample reweighting. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=LNpMtk15AS4.
  57. On the identifiability of the post-nonlinear causal model. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 647–655, Arlington, Virginia, USA, 2009. AUAI Press. ISBN 9780974903958.
  58. Truncated matrix power iteration for differentiable DAG learning. In Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=I4aSjFR7jOm.
  59. Dags with no tears: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf.
  60. Learning sparse nonparametric DAGs. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3414–3425. PMLR, 2020. URL https://proceedings.mlr.press/v108/zheng20a.html.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets