CoLiDE: Concomitant Linear DAG Estimation (2310.02895v2)
Abstract: We deal with the combinatorial problem of learning directed acyclic graph (DAG) structure from observational data adhering to a linear structural equation model (SEM). Leveraging advances in differentiable, nonconvex characterizations of acyclicity, recent efforts have advocated a continuous constrained optimization paradigm to efficiently explore the space of DAGs. Most existing methods employ lasso-type score functions to guide this search, which (i) require expensive penalty parameter retuning when the $\textit{unknown}$ SEM noise variances change across problem instances; and (ii) implicitly rely on limiting homoscedasticity assumptions. In this work, we propose a new convex score function for sparsity-aware learning of linear DAGs, which incorporates concomitant estimation of scale and thus effectively decouples the sparsity parameter from the exogenous noise levels. Regularization via a smooth, nonconvex acyclicity penalty term yields CoLiDE ($\textbf{Co}$ncomitant $\textbf{Li}$near $\textbf{D}$AG $\textbf{E}$stimation), a regression-based criterion amenable to efficient gradient computation and closed-form estimation of noise variances in heteroscedastic scenarios. Our algorithm outperforms state-of-the-art methods without incurring added complexity, especially when the DAGs are larger and the noise level profile is heterogeneous. We also find CoLiDE exhibits enhanced stability manifested via reduced standard deviations in several domain-specific metrics, underscoring the robustness of our novel linear DAG estimator.
- Efficient intervention design for causal discovery with latents. In Proc. Int. Conf. Mach. Learn., pp. 63–73. PMLR, 2020.
- Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
- DAGMA: Learning DAGs via M-matrices and a log-determinant acyclicity characterization. In Proc. Adv. Neural. Inf. Process. Syst., volume 35, pp. 8226–8239, 2022.
- Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
- Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat., 37:1705–1732, 2009.
- Differentiable causal discovery from interventional data. In Proc. Adv. Neural. Inf. Process. Syst., volume 33, pp. 21865–21877, 2020.
- CAM: Causal additive models, high-dimensional order search and penalized regression. Ann. Stat., 42(6):2526–2556, 2014.
- Statistical Inference. Cengage Learning, 2021.
- Differentiable DAG sampling. In Proc. Int. Conf. Learn. Representations, 2022.
- Learning bayesian networks with ancestral constraints. In Proc. Adv. Neural. Inf. Process. Syst., volume 29, 2016.
- David Maxwell Chickering. Learning Bayesian networks is NP-complete. Learning from Data: Artificial Intelligence and Statistics V, pp. 121–130, 1996.
- David Maxwell Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3(Nov):507–554, 2002.
- Large-sample learning of Bayesian networks is NP-hard. J. Mach. Learn. Res., 5:1287–1330, 2004.
- BCD Nets: Scalable variational approaches for Bayesian causal discovery. In Proc. Adv. Neural. Inf. Process. Syst., volume 34, pp. 7095–7110, 2021.
- Optimizing NOTEARS objectives via topological swaps. In Proc. Int. Conf. Mach. Learn., pp. 7563–7595. PMLR, 2023a.
- Global optimality in bivariate gradient-based DAG learning. In ICML 2023 Workshop: Sampling and Optimization in Discrete Space, 2023b.
- Characterizing distribution equivalence and structure learning for cyclic and acyclic directed graphs. In Proc. Int. Conf. Mach. Learn., pp. 3494–3504. PMLR, 2020.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Springer, 2009.
- Peter J Huber. Robust Statistics. John Wiley & Sons Inc., New York, 1981.
- Adam: A method for stochastic optimization. In Proc. Int. Conf. Learn. Representations, 2015.
- A survey of Bayesian network structure learning. Artif. Intell. Rev., 56:8721–8814, 2023.
- On fast convergence of proximal algorithms for sqrt-lasso optimization: Don’t worry about its nonsmooth loss function. In Uncertainty in Artificial Intelligence, pp. 49–59. PMLR, 2020.
- Efficient neural causal discovery without acyclicity constraints. In Proc. Int. Conf. Learn. Representations, 2021.
- High-dimensional learning of linear causal networks via inverse covariance estimation. J. Mach. Learn. Res., 15(1):3065–3105, 2014.
- Using heteroscedasticity consistent standard errors in the linear regression model. Am. Stat., 54(3):217–224, 2000.
- Bayesian networks in biomedicine and health-care. Artif. Intell. Med., 30(3):201–214, 2004.
- Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11(1), 2010.
- Generalized concomitant multi-task lasso for sparse multimodal regression. In Proc. Int. Conf. Artif. Intell. Statist., pp. 998–1007. PMLR, 2018.
- Efficient smoothed concomitant lasso estimation for high dimensional regression. In Journal of Physics: Conference Series, volume 904, pp. 012006, 2017.
- On the role of sparsity and DAG constraints for learning linear DAGs. In Proc. Adv. Neural. Inf. Process. Syst., volume 33, pp. 17943–17954, 2020.
- Advances in learning bayesian networks of bounded treewidth. In Proc. Adv. Neural. Inf. Process. Syst., volume 27, 2014.
- Art B Owen. A robust hybrid of lasso and ridge regression. Contemp. Math., 443(7):59–72, 2007.
- Bayesian network learning via topological order. J. Mach. Learn. Res., 18(1):3451–3482, 2017.
- Elements of Causal Inference: Foundations and Learning Algorithms. The MIT Press, 2017.
- Bayesian Networks: A Oractical Guide to Applications. John Wiley & Sons, 2008.
- A million variables and more: The fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal., 3:121–129, 2017.
- Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. In Proc. Adv. Neural. Inf. Process. Syst., volume 34, pp. 27772–27784, 2021.
- Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
- A Bayesian network structure for operational risk modelling in structured finance operations. J. Oper. Res. Soc., 63:431–444, 2012.
- Permutation-based causal structure learning with unknown intervention targets. In Conf. Uncertainty Artif. Intell., pp. 1039–1048. PMLR, 2020.
- ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-penalization for mixture regression models. Test, 19:209–256, 2010.
- Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc., B: Stat. Methodol., 58(1):267–288, 1996.
- Towards scalable Bayesian learning of causal DAGs. In Proc. Adv. Neural. Inf. Process. Syst., volume 33, pp. 6584–6594, 2020.
- D’ya like DAGs? A survey on structure learning and causal discovery. ACM Computing Surveys, 55(4):1–36, 2022.
- DAGs with no fears: A closer look at continuous optimization for learning Bayesian networks. In Proc. Adv. Neural. Inf. Process. Syst., volume 33, pp. 3895–3906, 2020.
- dotears: Scalable, consistent DAG estimation using observational and interventional data. arXiv preprint: arXiv:2305.19215 [stat.ML], pp. 1–37, 2023.
- Inexact block coordinate descent algorithms for nonsmooth nonconvex optimization. IEEE Trans. Signal Process., 68:947–961, 2020.
- DAG-GNN: DAG structure learning with graph neural networks. In Proc. Int. Conf. Mach. Learn., pp. 7154–7163. PMLR, 2019.
- DAG learning on the permutahedron. In Proc. Int. Conf. Learn. Representations, 2023.
- Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer’s disease. Cell, 153(3):707–720, 2013.
- DAGs with no tears: Continuous optimization for structure learning. In Proc. Adv. Neural. Inf. Process. Syst., volume 31, 2018.