LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL
Dzmitry Pihulski, Karol Charchut, Viktoria Novogrodskaia, Jan Kocoń
2025-10-07
Summary
This paper introduces LLMSQL, a revamped version of the older WikiSQL dataset, specifically designed to better test and train modern AI models that convert everyday questions into SQL database queries.
What's the problem?
The original WikiSQL dataset, while helpful in the past, has a lot of issues that make it hard to use with today’s powerful language models. These problems include inconsistencies in capitalization, incorrect data types, errors in the SQL code itself, and questions that don't actually have answers within the database. Basically, it's a messy dataset that doesn't accurately reflect how well these AI models are *really* doing.
What's the solution?
The researchers systematically went through the WikiSQL dataset, identified all the different types of errors, and then used automated tools to clean up the data and re-write the SQL queries. They didn't just update WikiSQL; they created LLMSQL to be a better benchmark for evaluating large language models. LLMSQL presents questions and SQL queries as simple text, making it easier for AI to learn to *generate* SQL, rather than just pick parts of existing queries like older models did.
Why it matters?
This is important because it provides a reliable way to measure the progress of AI in understanding and interacting with databases. A good benchmark like LLMSQL helps researchers develop better AI models that can allow people without coding knowledge to easily get information from databases using just plain English.
Abstract
Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.