LLM-Select: Feature Selection with Large Language Models (2407.02694v2)

Published 2 Jul 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: In this paper, we demonstrate a surprising capability of LLMs: given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., "blood pressure") in predicting an outcome of interest (e.g., "heart failure"), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could benefit practitioners in domains like healthcare and the social sciences, where collecting high-quality data comes at a high cost.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces three LLM-based methods (LLM-Score, LLM-Rank, and LLM-Seq) to quantify feature importance without direct data access.
The proposed techniques consistently match or surpass traditional methods like LASSO and MRMR across multiple diverse datasets.
LLM-Select demonstrates that LLMs can leverage generalized world knowledge to revolutionize feature selection in data-restricted environments.

Insightful Overview of "LLM-Select: Feature Selection with LLMs"

In the paper titled "LLM-Select: Feature Selection with LLMs," the authors present a novel approach for feature selection in supervised learning tasks utilizing LLMs such as GPT-4. The ability of these models to identify the most predictive features in a dataset without direct access to the data is thoroughly demonstrated. This paper explores this intriguing capability, highlighting how LLMs can act as domain-agnostic features selectors, a role traditionally dominated by data-driven statistical methods.

Methodology and Key Contributions

The paper introduces three distinct approaches for employing LLMs in feature selection: (i) LLM-Score, (ii) LLM-Rank, and (iii) LLM-Seq. Each method leverages the intrinsic knowledge captured by LLMs to estimate the importance or ranking of features based on their names and the prediction task at hand. Specifically:

LLM-Score involves prompting the LLM to output a numerical importance score for each feature independently.
LLM-Rank entails directly obtaining a ranked order of all features concerning their predictive utility.
LLM-Seq is inspired by sequential feature selection strategies, iteratively adding features based on LLM recommendations to maximize predictive power.

A noteworthy aspect is the abstraction from traditional data-driven processes, evident through the reliable performance of LLM-based methods on various datasets without explicitly engaging with the data instances.

Empirical Validation

The efficacy of these methods is substantiated by extensive experiments on datasets spanning multiple domains—healthcare, finance, and even publicly available repositories published after the original LLM training cut-off dates. A consistent theme across results is the competitive performance of LLM-based features selection compared to established methods like LASSO and MRMR. Particularly, larger models such as GPT-4 consistently exhibit robust feature selection capabilities, confirming the hypothesis that model scale correlates with decision coherence.

Theoretical and Practical Implications

The research posits significant theoretical implications. It suggests that LLMs encode a form of generalized knowledge about the world that can be harnessed for predictive modeling tasks, challenging the conventional belief that direct data analysis is necessary for effective feature selection. Practically, this could revolutionize feature selection in scenarios where data access is restricted due to privacy or computational constraints, such as in sensitive healthcare environments.

Future Directions

The authors highlight several areas for future exploration, such as mitigating inherent biases in pre-trained LLM outputs, especially for bias-sensitive applications like healthcare or legal analysis. Furthermore, the integration of LLM-driven and data-driven feature selection methods could enhance prediction robustness while leveraging the intuitive knowledge encapsulated in LLMs.

In conclusion, "LLM-Select: Feature Selection with LLMs" represents a significant contribution to the machine learning community, showcasing the versatile applications of LLMs beyond natural language processing. As LLMs continue to evolve, they may redefine approaches to feature selection, offering a blend of theoretical elegance and practical efficacy.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1808975660435845307

https://twitter.com/ADarmouni/status/1809723904425914631

https://twitter.com/danielpjeong/status/1814026021731234156

https://twitter.com/lzy_michael/status/1808783782344020230

https://twitter.com/ChaosActual2025/status/1809923592844456260

https://twitter.com/realmofresearch/status/1809813755435655383

YouTube

Show All Videos