Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Michał Pietruszka, Łukasz Borchmann, Aleksander Jędrosz, Paweł Morawiecki

2024-10-31

Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists

Summary

This paper introduces a new benchmark for evaluating large language models (LLMs) in the context of data science, specifically focusing on their ability to write feature engineering code.

What's the problem?

Writing feature engineering code is a complex task that requires not only understanding the data but also domain knowledge. Existing methods for evaluating LLMs in this area are limited and often do not effectively measure how well these models can generate useful code for transforming datasets. This gap makes it difficult to assess the true capabilities of LLMs in data science applications.

What's the solution?

The authors propose a benchmark called FeatEng, which allows LLMs to generate Python functions that modify data based on given descriptions. The quality of the generated code is evaluated by measuring how much it improves the performance of an XGBoost model trained on the modified dataset compared to the original. This approach provides a more comprehensive assessment of LLMs' abilities in feature engineering, allowing for better comparisons across different models.

Why it matters?

This research is significant because it helps improve our understanding of how well LLMs can assist in data science tasks. By creating a standardized way to evaluate these models, the study can lead to better tools and techniques for data scientists, ultimately making it easier to analyze and work with complex datasets.

Abstract

We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science: writing feature engineering code, which requires domain knowledge in addition to a deep understanding of the underlying problem and data structure. The model is provided with a dataset description in a prompt and asked to generate code transforming it. The evaluation score is derived from the improvement achieved by an XGBoost model fit on the modified dataset compared to the original data. By an extensive evaluation of state-of-the-art models and comparison to well-established benchmarks, we demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs, in contrast to the existing methods.

View Paper