Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists (2410.23331v1)

Published 30 Oct 2024 in cs.CL

Abstract: We present a benchmark for LLMs designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Alteryx. EvalML. https://evalml.alteryx.com/en/stable/, 2023.
  2. Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku, 2024.
  3. Program Synthesis with Large Language Models, 2021. URL https://arxiv.org/abs/2108.07732.
  4. PIQA: reasoning about physical commonsense in natural language. CoRR, abs/1911.11641, 2019. URL http://arxiv.org/abs/1911.11641.
  5. Ned Block. What is Functionalism? In Ned Joel Block (ed.), Readings in Philosophy of Psychology: 1. Harvard University Press, 1980.
  6. Mark Chen et al. Evaluating Large Language Models Trained on Code. ArXiv, abs/2107.03374, 2021. URL https://api.semanticscholar.org/CorpusID:235755472.
  7. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. ArXiv, abs/2403.04132, 2024a. URL https://api.semanticscholar.org/CorpusID:268264163.
  8. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, 2024b. URL https://arxiv.org/abs/2403.04132.
  9. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018. URL https://arxiv.org/abs/1803.05457.
  10. Training Verifiers to Solve Math Word Problems, 2021. URL https://arxiv.org/abs/2110.14168.
  11. DeepSeek-AI et al. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, 2024. URL https://arxiv.org/abs/2405.04434.
  12. Abhimanyu Dubey et al. The Llama 3 Herd of Models, 2024. URL https://arxiv.org/abs/2407.21783.
  13. H2O. H2O Driverless AI . https://h2o.ai/platform/ai-cloud/make/h2o-driverless-ai/, 2024.
  14. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021a.
  15. Measuring Mathematical Problem Solving With the MATH Dataset, 2021b. URL https://arxiv.org/abs/2103.03874.
  16. Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  44753–44775. Curran Associates, Inc., 2023.
  17. William James. Pragmatism’s Conception of Truth. Journal of Philosophy, Psychology and Scientific Methods, 4(6):141–155, 1907. doi: 10.2307/2012189.
  18. Auto-Keras: An Efficient Neural Architecture Search System, 2019. URL https://arxiv.org/abs/1806.10282.
  19. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  20. Kaggle. Kaggle. https://www.kaggle.com/, 2024.
  21. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics, 2019.
  22. DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation, 2022. URL https://arxiv.org/abs/2211.11501.
  23. Jamba: A Hybrid Transformer-Mamba Language Model, 2024. URL https://arxiv.org/abs/2403.19887.
  24. On the importance of domain expertise in feature engineering for predictive product quality in production. Procedia CIRP, 118:1096–1101, 2023. ISSN 2212-8271. doi: https://doi.org/10.1016/j.procir.2023.06.188. URL https://www.sciencedirect.com/science/article/pii/S2212827123004158. 16th CIRP Conference on Intelligent Computation in Manufacturing Engineering.
  25. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, 2018. URL https://arxiv.org/abs/1809.02789.
  26. Mistral AI, 2024a. URL https://mistral.ai/news/codestral/.
  27. Mistral AI, 2024b. URL https://mistral.ai/news/mistral-large-2407/.
  28. Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. CoRR, abs/1603.06212, 2016. URL http://arxiv.org/abs/1603.06212.
  29. OpenAI et al. GPT-4 Technical Report, 2024. URL https://arxiv.org/abs/2303.08774.
  30. Zenon W. Pylyshyn. Computation and Cognition: Toward a Foundation for Cognitive Science. MIT Press, Cambridge, 1984.
  31. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 2011.
  32. Horse race analysis in credit card fraud—deep learning, logistic regression, and Gradient Boosted Tree. In 2017 Systems and Information Engineering Design Symposium (SIEDS), pp.  117–121, 2017. doi: 10.1109/SIEDS.2017.7937700.
  33. MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning, 2024. URL https://arxiv.org/abs/2310.16049.
  34. Aarohi Srivastava et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models, 2023. URL https://arxiv.org/abs/2206.04615.
  35. Gemini Team et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530.
  36. Melvyn Weeks. Introductory Econometrics: A Modern Approach, 2002.
  37. Language Models Learn to Mislead Humans via RLHF, 2024. URL https://arxiv.org/abs/2409.12822.
  38. HellaSwag: Can a Machine Really Finish Your Sentence?, 2019. URL https://arxiv.org/abs/1905.07830.

Summary

  • The paper introduces FeatEng, a benchmark focusing on evaluating LLM-generated feature engineering code on practical data science tasks.
  • It assesses four key dimensions: practical usability, world knowledge application, complex skill integration, and resistance to exploitation.
  • The experimental analysis shows that models like o1-preview and Gemini outperform traditional baselines, highlighting significant advances in automated feature engineering.

Evaluating LLMs as Data Scientists: An In-Depth Overview

The paper "Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists" introduces a new benchmark named FeatEng, specifically designed to assess the data engineering capabilities of LLMs. The benchmark focuses on a particularly intricate aspect of data science: feature engineering. The core of the evaluation rests on a model's ability to generate feature engineering code, which demands a robust understanding of both the domain-specific data and the overall intricacies of machine learning.

Key Contributions

The authors present a novel benchmarking approach that centers around practical usability, integration of world knowledge, complex skill integration, and resistance to exploitation. These are identified as the four crucial aspects that any benchmark should address to thoroughly evaluate the problem-solving capabilities of LLMs in real-world scenarios.

  1. Practical Usability: The tasks in FeatEng are anchored in real-world data science problems, ensuring that any improvement in the model's output translates into a tangible functional value. By targeting the improvement of an XGBoost model’s performance after modification via the LLM-generated code, the benchmark embodies the pragmatist view on the instrumental value of knowledge.
  2. World Knowledge Application: The benchmark mandates that LLMs dynamically employ their extensive knowledge bases to solve problems contextually. It necessitates the models to not only possess a wide range of knowledge but also effectively apply it to the given datasets, mimicking real-world problem-solving.
  3. Complex Skill Integration: FeatEng evaluates LLMs on their ability to integrate multiple skills — such as data interpretation, reasoning, and code generation — to complete multifaceted tasks. This requirement is reflective of the need for models to combine various data processing functions seamlessly.
  4. Resistance to Exploitation: The benchmark’s design inherently resists exploitation through memorization or reliance on dataset-specific biases. This is achieved by structuring tasks that require real improvements in the models’ capabilities, forcing them to generalize across different datasets.

Experimental Insights and Comparisons

The paper extensively evaluates several state-of-the-art LLMs using FeatEng, revealing distinct performance variability. These results indicate that models like o1-preview and Gemini exhibit superior performance, underscoring significant improvements over traditional baselines. However, the paper also highlights that existing open-source models show competitive capabilities in feature engineering tasks.

A noteworthy aspect of the paper is its comparison of FeatEng to existing benchmarks like MMLU, HumanEval, and more. The authors argue that while these existing benchmarks target individual competencies, they fall short in evaluating a model’s ability to synthesize these skills within a realistic problem-solving context.

Implications and Future Directions

From a practical standpoint, FeatEng presents a comprehensive tool for assessing LLM capabilities in data science, with potential applications in automated machine learning (AutoML) systems. The ability to autonomously generate sophisticated feature transformations could streamline the feature engineering process, saving time and enhancing the overall efficiency of data science workflows.

From a theoretical perspective, the FeatEng benchmark advances the discourse on LLM evaluation by emphasizing the integration of multiple skills in a practical context. Such an approach acknowledges the complexity of real-world data science tasks and sets a higher bar for what LLMs need to achieve to be considered genuinely capable.

Looking forward, the research suggests several avenues for future exploration. Enhancing the robustness of the benchmark by incorporating a larger and more diverse set of datasets could reduce potential biases. Additionally, there is scope for refining the evaluation metrics to offer an even more nuanced understanding of the models' performance.

In summary, the FeatEng benchmark represents a significant advancement in the domain of LLM evaluation, striving to provide a holistic view of a model's ability to perform realistic data science tasks. As LLMs continue to evolve, tools like FeatEng will be critical in developing models that are not only knowledgeable but also practically effective in real-world applications.