Machine Learning Fund Categorizations (2006.00123v1)

Published 29 May 2020 in q-fin.ST, cs.LG, q-fin.CP, and stat.ML

Abstract: Given the surge in popularity of mutual funds (including exchange-traded funds (ETFs)) as a diversified financial investment, a vast variety of mutual funds from various investment management firms and diversification strategies have become available in the market. Identifying similar mutual funds among such a wide landscape of mutual funds has become more important than ever because of many applications ranging from sales and marketing to portfolio replication, portfolio diversification and tax loss harvesting. The current best method is data-vendor provided categorization which usually relies on curation by human experts with the help of available data. In this work, we establish that an industry wide well-regarded categorization system is learnable using machine learning and largely reproducible, and in turn constructing a truly data-driven categorization. We discuss the intellectual challenges in learning this man-made system, our results and their implications.

Citations (6)

View on Semantic Scholar

Summary

The paper demonstrates that expert-driven mutual fund categorizations can be learned effectively using supervised machine learning on a large annotated dataset.
The paper finds that Deep Neural Networks and Random Forests outperform Decision Trees, highlighting the significance of equity industry and benchmark indices in fund categorization.
The paper identifies misclassification challenges with 'Miscellaneous' funds, suggesting that enhanced feature engineering could further improve categorization accuracy.

Leveraging Machine Learning for Mutual Fund Categorization

Introduction to Mutual Fund Categorization Challenges

Mutual funds, including exchange-traded funds (ETFs), offer a diversified investment strategy across various asset classes, sectors, and geographical locations. With the escalating variety of mutual funds, identifying similar funds for purposes such as portfolio diversification and tax loss harvesting has gained paramount importance. Traditional approaches to fund categorization often involve expert-driven systems such as those provided by data vendors like Morningstar and Lipper, which employ committees of experts to classify funds based on both qualitative and quantitative assessments. However, these methods are not without limitations, including potential biases and the subjective nature of qualitative analysis.

Previous Efforts and Limitations

Prior studies have employed data-driven unsupervised clustering approaches to categorize mutual funds based on selected variables. Such studies have revealed discrepancies between clustering results and vendor-provided categorizations, suggesting the latter may not always align with purely quantitative analyses. These investigations underline the complexity of quantifying fund similarity, reflecting challenges in removing biases and emotional aspects from the process, identifying similarity metrics, and determining the nonlinear relationships among features.

Our Approach to Mutual Fund Categorization

This paper introduces a novel approach to mutual fund categorization by framing the problem as a supervised multi-class classification task aimed at learning the Morningstar categorization system. Utilizing a dataset of 10,300 mutual funds from Morningstar Direct, annotated with labels from Morningstar's Global categories, we deploy Decision Tree, Random Forest, and Deep Neural Network models to predict fund categorizations based on a variety of features including asset allocation, equity industry breakdown, fixed income sector information, and benchmarks.

Key Insights and Results

Our research demonstrates that the Morningstar fund categorization system is learnable and largely reproducible through machine learning techniques. This finding is significant as it validates the hypothesis that expert-driven fund categorization can be mirrored by algorithms, leveraging only aggregate-level holding information.

Model Performance: Among the utilized models, Deep Neural Networks showed the highest accuracy and performance metrics, closely followed by Random Forest. Both models significantly outperformed the baseline established by the Decision Tree model.
Feature Importance: Random Forest results highlighted the most influential features in fund categorization, with Equity Industry, S&P Dow Jones Benchmark, and FTSE/Russell Benchmark emerging as critical predictors. This insight suggests an emphasis on industry breakdown and benchmark comparisons in the categorization process.
Misclassifications: Analysis of misclassifications shed light on areas where more information or alternative features could enhance categorization accuracy. In particular, funds with 'Miscellaneous' in their category name posed challenges, potentially due to a lack of distinguishing benchmarks or intrinsic features.

Future Directions and Conclusion

The ability to algorithmically learn and reproduce industry-standard fund categorizations opens doors to several applications, including automated categorization of new funds and detailed analysis of categorization discrepancies. Future research could explore more granular categorizations, incorporate additional variables, or apply more advanced machine learning models to further dissect and understand the landscape of mutual funds.

In sum, this paper establishes a foundational step towards constructing a truly data-driven categorization of mutual funds, offering both a validation of the learnability of expert-driven systems and a blueprint for future advancements in financial product categorization algorithms.

PDF Markdown