2000 character limit reached
Inference for Regression with Variables Generated by AI or Machine Learning (2402.15585v5)
Published 23 Feb 2024 in econ.EM and stat.ML
Abstract: Researchers now routinely use AI or other machine learning methods to estimate latent variables of economic interest, then plug-in the estimates as covariates in a regression. We show both theoretically and empirically that naively treating AI/ML-generated variables as "data" leads to biased estimates and invalid inference. To restore valid inference, we propose two methods: (1) an explicit bias correction with bias-corrected confidence intervals, and (2) joint estimation of the regression parameters and latent variables. We illustrate these ideas through applications involving label imputation, dimensionality reduction, and index construction via classification and aggregation.
- Death by Committee? An Analysis of Corporate Board (Sub-) Committees. Journal of Financial Economics, forthcoming.
- Bayesian Topic Regression for Causal Inference. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8162–8188, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Machine Learning and Prediction Errors in Causal Inference. SSRN Electronic Journal.
- Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA. Association for Computing Machinery.
- A Latent Variable Model Approach to PMI-based Word Embeddings. Transactions of the Association for Computational Linguistics, 4:385–399.
- Text Algorithms in Economics. Annual Review of Economics, 15(1):659–688.
- Confidence Intervals for Diffusion Index Forecasts and Inference for Factor-Augmented Regressions. Econometrica, 74(4):1133–1150.
- Measuring Economic Policy Uncertainty*. The Quarterly Journal of Economics, 131(4):1593–1636.
- Do Women Respond Less to Performance Pay? Building Evidence from Multiple Experiments. American Economic Review: Insights, 3(4):435–454.
- CEO Behavior and Firm Performance. Journal of Political Economy, 128(4):1325–1369.
- Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach. The Quarterly Journal of Economics, 120(1):387–422.
- Betancourt, M. (2018). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 [stat].
- Optimal estimation of sparse topic models. Journal of Machine Learning Research, 21:1–45.
- Pyro: Deep Universal Probabilistic Programming. arXiv:1810.09538 [cs, stat].
- Supervised Topic Models. arXiv:1003.0783 [stat].
- Latent dirichlet allocation. The Journal of Machine Learning Research, 3(null):993–1022.
- Bonhomme, S. (2021). Teams: Heterogeneity, Sorting, and Complementarity. SSRN Electronic Journal.
- JAX: Composable transformations of Python+NumPy programs.
- The Structure of Economic News. Technical Report w26648, National Bureau of Economic Research.
- Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1):1–32.
- Monte Carlo Confidence Sets for Identified Sets. Econometrica, 86(6):1965–2018.
- Chesher, A. (1991). The effect of measurement error. Biometrika, 78(3):451–462.
- Demand Estimation with Text and Image Data. Technical Report 10695, CESifo.
- Hamiltonian Monte Carlo with Energy Conserving Subsampling. Journal of Machine Learning Research, 20(100):1–31.
- Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8:439–453.
- How Polarized are Citizens? Measuring Ideology from the Ground-Up. SSRN Scholarly Paper ID 3154431, Social Science Research Network, Rochester, NY.
- Producing Health: Measuring Value Added of Nursing Homes.
- Machine Learning Predictions as Regression Covariates. Political Analysis, 29(4):467–484.
- Asset Embeddings. SSRN Electronic Journal.
- Emotion and Reason in Political Language. The Economic Journal, 132(643):1037–1059.
- Text as Data. Journal of Economic Literature, 57(3):535–574.
- Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech. Econometrica, 87(4):1307–1340.
- The Voice of Monetary Policy. American Economic Review, 113(2):548–584.
- Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235.
- Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach*. The Quarterly Journal of Economics, 133(2):801–870.
- Bayesian Estimation of DSGE Models. Princeton University Press, Princeton.
- Text-Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy, 124(5):1423–1465.
- Stochastic Variational Inference. Journal of Machine Learning Research, 14(4):1303–1347.
- The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593–1623.
- Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, Berkeley California USA. ACM.
- An Introduction to Variational Methods for Graphical Models. In Jordan, M. I., editor, Learning in Graphical Models, NATO ASI Series, pages 105–161. Springer Netherlands, Dordrecht.
- Robust Machine Learning Algorithms for Text Analysis. Unpublished manuscript.
- Using SVD for Topic Modeling. Journal of the American Statistical Association, pages 1–16.
- Technology, Vintage-Specific Human Capital, and Labor Displacement: Evidence from Linking Patents with Occupations.
- The value of news for economic developments. Journal of Econometrics, 210(1):203–218.
- MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge, UK ; New York, illustrated edition edition.
- Embeddings and Distance-based Demand for Differentiated Products. In Proceedings of the 23rd ACM Conference on Economics and Computation, EC ’22, page 607, New York, NY, USA. Association for Computing Machinery.
- The making of hawks and doves. Journal of Monetary Economics, 117:19–42.
- Concentration Inequalities for the Empirical Distribution.
- Meager, R. (2019). Understanding the Average Impact of Microcredit Expansions: A Bayesian Hierarchical Analysis of Seven Randomized Experiments. American Economic Journal: Applied Economics, 11(1):57–91.
- Approximate Variational Estimation for a Model of Network Formation. The Review of Economics and Statistics, 105(1):113–124.
- Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs].
- Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546 [cs, stat].
- Reading Between the Lines: Prediction of Political Violence Using Newspaper Text. American Political Science Review, 112(2):358–375.
- Müller, U. (2013). Risk of Bayesian Inference in Misspecified Models, and the Sandwich Covariance Matrix. Econometrica, 81(5):1805–1849.
- Latent Dirichlet Analysis of Categorical Survey Responses. Journal of Business & Economic Statistics, 40(1):256–271.
- Neal, R. M. (2012). MCMC using Hamiltonian dynamics. arXiv:1206.1901 [physics, stat].
- Nimczik, J. S. (2017). Job Mobility Networks and Endogenous Labor Markets. Technical Report 168147, Verein für Socialpolitik / German Economic Association.
- Estimating Nursing Home Quality with Selection.
- Dynamic Stochastic Blockmodel Regression for Network Data: Application to International Militarized Conflicts. arXiv:2103.00702 [cs, stat].
- Pagan, A. (1984). Econometric Issues in the Analysis of Regressions with Generated Regressors. International Economic Review, 25(1):221–247.
- Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro. arXiv:1912.11554 [cs, stat].
- Phillips, P. C. B. (1987). Towards a unified asymptotic theory for autoregression. Biometrika, 74(3):535–547.
- Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4):1064–1082.
- SHOPPER: A probabilistic model of consumer choice with substitutes and complements. The Annals of Applied Statistics, 14(1):1–27.
- Instrumental Variables Regression with Weak Instruments. Econometrica, 65(3):557–586.
- Forecasting Using Principal Components from a Large Number of Predictors. Journal of the American Statistical Association, 97(460):1167–1179.
- Thorsrud, L. A. (2020). Words are the New Numbers: A Newsy Coincident Index of the Business Cycle. Journal of Business & Economic Statistics, 38(2):393–409.
- Tropp, J. A. (2012). User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of Computational Mathematics, 12(4):389–434.
- Decomposing Changes in the Gender Wage Gap over Worker Careers. In NBER Summer Institute, Boston, MA.
- Text-Based Ideal Points. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5345–5357, Online. Association for Computational Linguistics.
- van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge, UK.
- Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305.
- White, H. (1980). A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica, 48(4):817.
- Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference. Journal of the American Statistical Association, 118(543):1849–1861.
- Debiasing Machine-Learning- or AI-Generated Regressors in Partial Linear Models. SSRN Electronic Journal.