Dice Question Streamline Icon: https://streamlinehq.com

Develop oversampling methodology for high-dimensional text embeddings

Develop a principled adaptation of synthetic minority oversampling techniques, such as SMOTE, for high-dimensional text embedding features used in text-based forecasting tasks, so that class imbalance can be addressed without distorting the embedding space or introducing artifacts. Ascertain an approach that is compatible with sentence- or document-level embeddings employed in credit rating forecasting.

Information Square Streamline Icon: https://streamlinehq.com

Background

The task exhibits severe class imbalance, with the vast majority of credit ratings remaining unchanged. While oversampling methods like SMOTE are commonly adopted in credit risk prediction with numerical features, the authors note the absence of agreed approaches for high-dimensional text embeddings.

Because of this methodological gap, the authors avoided oversampling for text features and instead balanced classes by downsampling, reducing dataset size and potentially limiting model choices and performance. A robust oversampling technique for embeddings would enable more faithful use of textual signals without sacrificing sample size.

References

There is no consensus on how SMOTE could be applied to high-dimensional text embeddings.

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs (2407.17624 - Drinkall et al., 24 Jul 2024) in Section 3.2 Dataset Construction