Text2SQL is Not Enough: Unifying AI and Databases with TAG

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, Matei Zaharia

2024-08-28

Text2SQL is Not Enough: Unifying AI and Databases with TAG

Summary

This paper introduces TAG, a new approach for combining artificial intelligence (AI) with databases to answer natural language questions more effectively than existing methods.

What's the problem?

Current systems that translate natural language questions into database queries, like Text2SQL, are limited because they can only handle a small set of questions. This makes them less useful for real-world applications where users want to ask a wide variety of questions about their data. Existing methods also struggle with retrieving information accurately and efficiently.

What's the solution?

The authors propose Table-Augmented Generation (TAG), a unified system that allows for more complex interactions between AI language models and databases. TAG expands the types of questions that can be answered by exploring new ways for the AI to process and understand data. They also created benchmarks to evaluate how well TAG performs compared to existing methods, revealing that current techniques often fail to answer more than 80% of queries correctly.

Why it matters?

This research is important because it aims to enhance how we interact with databases using natural language, making it easier for people to get the information they need without having technical knowledge. By improving the capabilities of AI systems in this area, TAG could lead to better decision-making and insights across various fields, from business analytics to scientific research.

Abstract

AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL methods focus solely on natural language questions that can be expressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited subset of queries that can be answered with point lookups to one or a few data records within the database. We propose Table-Augmented Generation (TAG), a unified and general-purpose paradigm for answering natural language questions over databases. The TAG model represents a wide range of interactions between the LM and database that have been previously unexplored and creates exciting research opportunities for leveraging the world knowledge and reasoning capabilities of LMs over data. We systematically develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly, confirming the need for further research in this area. We release code for the benchmark at https://github.com/TAG-Research/TAG-Bench.

View Paper