Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inference for Regression with Variables Generated by AI or Machine Learning (2402.15585v5)

Published 23 Feb 2024 in econ.EM and stat.ML

Abstract: Researchers now routinely use AI or other machine learning methods to estimate latent variables of economic interest, then plug-in the estimates as covariates in a regression. We show both theoretically and empirically that naively treating AI/ML-generated variables as "data" leads to biased estimates and invalid inference. To restore valid inference, we propose two methods: (1) an explicit bias correction with bias-corrected confidence intervals, and (2) joint estimation of the regression parameters and latent variables. We illustrate these ideas through applications involving label imputation, dimensionality reduction, and index construction via classification and aggregation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Death by Committee? An Analysis of Corporate Board (Sub-) Committees. Journal of Financial Economics, forthcoming.
  2. Bayesian Topic Regression for Causal Inference. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8162–8188, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  3. Machine Learning and Prediction Errors in Causal Inference. SSRN Electronic Journal.
  4. Computing a nonnegative matrix factorization – provably. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 145–162, New York, NY, USA. Association for Computing Machinery.
  5. A Latent Variable Model Approach to PMI-based Word Embeddings. Transactions of the Association for Computational Linguistics, 4:385–399.
  6. Text Algorithms in Economics. Annual Review of Economics, 15(1):659–688.
  7. Confidence Intervals for Diffusion Index Forecasts and Inference for Factor-Augmented Regressions. Econometrica, 74(4):1133–1150.
  8. Measuring Economic Policy Uncertainty*. The Quarterly Journal of Economics, 131(4):1593–1636.
  9. Do Women Respond Less to Performance Pay? Building Evidence from Multiple Experiments. American Economic Review: Insights, 3(4):435–454.
  10. CEO Behavior and Firm Performance. Journal of Political Economy, 128(4):1325–1369.
  11. Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach. The Quarterly Journal of Economics, 120(1):387–422.
  12. Betancourt, M. (2018). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434 [stat].
  13. Optimal estimation of sparse topic models. Journal of Machine Learning Research, 21:1–45.
  14. Pyro: Deep Universal Probabilistic Programming. arXiv:1810.09538 [cs, stat].
  15. Supervised Topic Models. arXiv:1003.0783 [stat].
  16. Latent dirichlet allocation. The Journal of Machine Learning Research, 3(null):993–1022.
  17. Bonhomme, S. (2021). Teams: Heterogeneity, Sorting, and Complementarity. SSRN Electronic Journal.
  18. JAX: Composable transformations of Python+NumPy programs.
  19. The Structure of Economic News. Technical Report w26648, National Bureau of Economic Research.
  20. Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1):1–32.
  21. Monte Carlo Confidence Sets for Identified Sets. Econometrica, 86(6):1965–2018.
  22. Chesher, A. (1991). The effect of measurement error. Biometrika, 78(3):451–462.
  23. Demand Estimation with Text and Image Data. Technical Report 10695, CESifo.
  24. Hamiltonian Monte Carlo with Energy Conserving Subsampling. Journal of Machine Learning Research, 20(100):1–31.
  25. Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8:439–453.
  26. How Polarized are Citizens? Measuring Ideology from the Ground-Up. SSRN Scholarly Paper ID 3154431, Social Science Research Network, Rochester, NY.
  27. Producing Health: Measuring Value Added of Nursing Homes.
  28. Machine Learning Predictions as Regression Covariates. Political Analysis, 29(4):467–484.
  29. Asset Embeddings. SSRN Electronic Journal.
  30. Emotion and Reason in Political Language. The Economic Journal, 132(643):1037–1059.
  31. Text as Data. Journal of Economic Literature, 57(3):535–574.
  32. Measuring Group Differences in High-Dimensional Choices: Method and Application to Congressional Speech. Econometrica, 87(4):1307–1340.
  33. The Voice of Monetary Policy. American Economic Review, 113(2):548–584.
  34. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235.
  35. Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach*. The Quarterly Journal of Economics, 133(2):801–870.
  36. Bayesian Estimation of DSGE Models. Princeton University Press, Princeton.
  37. Text-Based Network Industries and Endogenous Product Differentiation. Journal of Political Economy, 124(5):1423–1465.
  38. Stochastic Variational Inference. Journal of Machine Learning Research, 14(4):1303–1347.
  39. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593–1623.
  40. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57, Berkeley California USA. ACM.
  41. An Introduction to Variational Methods for Graphical Models. In Jordan, M. I., editor, Learning in Graphical Models, NATO ASI Series, pages 105–161. Springer Netherlands, Dordrecht.
  42. Robust Machine Learning Algorithms for Text Analysis. Unpublished manuscript.
  43. Using SVD for Topic Modeling. Journal of the American Statistical Association, pages 1–16.
  44. Technology, Vintage-Specific Human Capital, and Labor Displacement: Evidence from Linking Patents with Occupations.
  45. The value of news for economic developments. Journal of Econometrics, 210(1):203–218.
  46. MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge, UK ; New York, illustrated edition edition.
  47. Embeddings and Distance-based Demand for Differentiated Products. In Proceedings of the 23rd ACM Conference on Economics and Computation, EC ’22, page 607, New York, NY, USA. Association for Computing Machinery.
  48. The making of hawks and doves. Journal of Monetary Economics, 117:19–42.
  49. Concentration Inequalities for the Empirical Distribution.
  50. Meager, R. (2019). Understanding the Average Impact of Microcredit Expansions: A Bayesian Hierarchical Analysis of Seven Randomized Experiments. American Economic Journal: Applied Economics, 11(1):57–91.
  51. Approximate Variational Estimation for a Model of Network Formation. The Review of Economics and Statistics, 105(1):113–124.
  52. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs].
  53. Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546 [cs, stat].
  54. Reading Between the Lines: Prediction of Political Violence Using Newspaper Text. American Political Science Review, 112(2):358–375.
  55. Müller, U. (2013). Risk of Bayesian Inference in Misspecified Models, and the Sandwich Covariance Matrix. Econometrica, 81(5):1805–1849.
  56. Latent Dirichlet Analysis of Categorical Survey Responses. Journal of Business & Economic Statistics, 40(1):256–271.
  57. Neal, R. M. (2012). MCMC using Hamiltonian dynamics. arXiv:1206.1901 [physics, stat].
  58. Nimczik, J. S. (2017). Job Mobility Networks and Endogenous Labor Markets. Technical Report 168147, Verein für Socialpolitik / German Economic Association.
  59. Estimating Nursing Home Quality with Selection.
  60. Dynamic Stochastic Blockmodel Regression for Network Data: Application to International Militarized Conflicts. arXiv:2103.00702 [cs, stat].
  61. Pagan, A. (1984). Econometric Issues in the Analysis of Regressions with Generated Regressors. International Economic Review, 25(1):221–247.
  62. Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro. arXiv:1912.11554 [cs, stat].
  63. Phillips, P. C. B. (1987). Towards a unified asymptotic theory for autoregression. Biometrika, 74(3):535–547.
  64. Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4):1064–1082.
  65. SHOPPER: A probabilistic model of consumer choice with substitutes and complements. The Annals of Applied Statistics, 14(1):1–27.
  66. Instrumental Variables Regression with Weak Instruments. Econometrica, 65(3):557–586.
  67. Forecasting Using Principal Components from a Large Number of Predictors. Journal of the American Statistical Association, 97(460):1167–1179.
  68. Thorsrud, L. A. (2020). Words are the New Numbers: A Newsy Coincident Index of the Business Cycle. Journal of Business & Economic Statistics, 38(2):393–409.
  69. Tropp, J. A. (2012). User-Friendly Tail Bounds for Sums of Random Matrices. Foundations of Computational Mathematics, 12(4):389–434.
  70. Decomposing Changes in the Gender Wage Gap over Worker Careers. In NBER Summer Institute, Boston, MA.
  71. Text-Based Ideal Points. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5345–5357, Online. Association for Computational Linguistics.
  72. van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge, UK.
  73. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305.
  74. White, H. (1980). A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica, 48(4):817.
  75. Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference. Journal of the American Statistical Association, 118(543):1849–1861.
  76. Debiasing Machine-Learning- or AI-Generated Regressors in Partial Linear Models. SSRN Electronic Journal.

Summary

  • The paper demonstrates that the traditional two-step approach leads to biased regression estimates due to measurement error in AI-generated variables.
  • It employs Hamiltonian Monte Carlo to efficiently perform high-dimensional integration and jointly estimate both information retrieval and econometric models.
  • Empirical tests on CEO behavior confirm that the one-step strategy produces less biased coefficient estimates even with limited unstructured data.

Addressing Bias in Analyzing Unstructured Data with a One-Step Inference Strategy

Analyzing unstructured data, such as text, images, and audio recordings, is becoming increasingly important in empirical research, particularly in economics. Typically, this analysis involves a two-step strategy: first, deriving quantitative representations from unstructured data using information retrieval models; and second, treating these representations as data in downstream econometric models for further analysis. While pragmatic, this approach is fraught with challenges, notably measurement error, which this paper rigorously examines. It posits that the conventional two-step strategy leads to biased inference on downstream regression coefficients due to this measurement error. Moreover, the magnitude of this bias is contingent on the relative sizes of measurement error and sampling error, potentially leading to incorrect empirical conclusions under certain conditions.

Theoretical Insights and Practical Solutions

Through a detailed examination, the paper provides a comprehensive theoretical framework that illustrates why and how the two-step strategy can lead to biased inference. This issue arises because the estimated latent variables inherent in the representations of unstructured data are treated as observed variables in subsequent econometric analysis, overlooking the measurement error introduced in the first step. As an alternative, the paper proposes a robust one-step strategy for valid inference that jointly estimates both the information retrieval and econometric models, thereby accommodating the measurement error directly.

Computational Methodology: A Novel Inference Approach

Implementing the one-step strategy poses significant computational challenges, particularly due to the necessity of high-dimensional numerical integration. This paper navigates these challenges by employing Hamiltonian Monte Carlo (HMC), a Markov Chain Monte Carlo algorithm that is well-suited for sampling from complex, high-dimensional distributions. Leveraged in conjunction with modern probabilistic programming languages, HMC facilitates scalable and efficient inference across large datasets. The practicality of this approach is underscored through comprehensive simulation exercises and an empirical application analyzing CEO behavior, demonstrating its superiority over the conventional two-step method.

Empirical Validation and Insights

The empirical analysis revisits a paper of CEO time use, contrasting findings from the one-step and two-step strategies. Notably, when the amount of unstructured data per observation is limited, the one-step strategy yields considerably less biased estimates of regression coefficients related to CEO behavior and firm performance. These outcomes are consistent across various simulation settings and empirical applications, underscoring the importance of the proposed methodology in addressing measurement error.

Towards Robust Analysis of Unstructured Data

This research marks a significant advance in the empirical analysis of unstructured data, providing a rigorous methodological foundation to combat the pervasive issue of measurement error. By integrating the information retrieval and econometric models, the one-step strategy offers a more accurate and theoretically sound approach to analyzing unstructured data. As the volume of such data continues to grow, this methodology will undoubtedly become a critical tool for researchers seeking to harness its full potential in empirical analysis.

Future Directions and Scalability

Looking ahead, the paper acknowledges the scalability limitations of HMC and suggests that alternative methods, such as variational inference, may offer viable solutions for analyzing massive datasets. The paper's insights and methodologies not only have immediate practical applications but also open avenues for future research in developing scalable and statistically robust tools for analyzing unstructured data.

In conclusion, this paper makes a pivotal contribution to the literature, challenging the prevailing two-step strategy and providing a compelling alternative that mitigates bias through a sophisticated computational approach. Its ramifications extend beyond economics, offering a valuable framework for any field grappling with the analysis of unstructured data.