Large Scale Transfer Learning for Tabular Data via Language Modeling

Josh Gardner, Juan C. Perdomo, Ludwig Schmidt

2024-06-19

Large Scale Transfer Learning for Tabular Data via Language Modeling

Summary

This paper introduces TabuLa-8B, a new language model designed specifically for making predictions using tabular data, which is data organized in rows and columns like spreadsheets. The authors aim to improve how AI models can work with this type of data by using a large training dataset and advanced techniques.

What's the problem?

Tabular data is commonly used in many fields, but existing AI models often struggle to learn from it effectively. While there have been significant advancements in AI for tasks like language processing and image recognition, similar progress has not been made for tabular data. This means that many models are still limited in their ability to make accurate predictions based on structured data.

What's the solution?

To address this issue, the authors developed TabuLa-8B, which fine-tunes a large language model called Llama 3-8B specifically for tabular prediction tasks. They created a high-quality training dataset from the TabLib corpus, which includes over 1.6 billion rows from more than 3 million unique tables. TabuLa-8B uses innovative techniques to pack and pay attention to the data effectively. The results showed that TabuLa-8B could make accurate predictions on unseen tables with zero-shot accuracy that is significantly better than random guessing—over 15 percentage points higher. In situations where only a few examples are provided (few-shot learning), it outperformed other leading models like XGBoost and TabPFN even when those models were trained on much larger datasets.

Why it matters?

This research is important because it represents a significant step forward in using AI for tabular data analysis. By improving how models can learn from structured data, TabuLa-8B could help make better predictions in various applications, such as finance, healthcare, and marketing. This advancement could lead to more effective decision-making tools that utilize the vast amounts of tabular data available in many industries.

Abstract

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

View Paper