Tabby: Tabular Data Synthesis with Language Models
Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala
2025-03-05

Summary
This paper talks about Tabby, a new method for creating realistic fake tabular data, like spreadsheets or databases, using AI language models.
What's the problem?
While AI has gotten really good at creating text, it struggles to make realistic tabular data, which is important for tasks like data analysis or simulations. Current methods often don't match the quality of real data and aren't designed specifically for tables.
What's the solution?
The researchers developed Tabby, which modifies existing AI models to focus on tabular data. It uses a technique called Gated Mixture-of-Experts to handle differences between columns in a table. They also created a training method called Plain, which improves how the AI learns from tabular data. Together, these methods make the synthetic data almost as good as real data.
Why it matters?
This matters because it makes it easier to generate high-quality fake data for research or testing without needing access to sensitive real-world datasets. Tabby could help fields like healthcare, finance, and science by providing realistic data while protecting privacy.
Abstract
While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.