Develop oversampling methodology for high-dimensional text embeddings
Develop a principled adaptation of synthetic minority oversampling techniques, such as SMOTE, for high-dimensional text embedding features used in text-based forecasting tasks, so that class imbalance can be addressed without distorting the embedding space or introducing artifacts. Ascertain an approach that is compatible with sentence- or document-level embeddings employed in credit rating forecasting.
References
There is no consensus on how SMOTE could be applied to high-dimensional text embeddings.
— Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs
(2407.17624 - Drinkall et al., 24 Jul 2024) in Section 3.2 Dataset Construction