< Explain other AI papers

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan

2025-10-08

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

Summary

This paper introduces a new way to test how well computer programs can turn natural language questions into commands for databases, but it focuses on realistic, back-and-forth conversations instead of just single questions.

What's the problem?

Current tests for these programs don't accurately reflect how databases are used in the real world. Real database interactions aren't just one question at a time; they involve clarifying ambiguous requests, fixing errors that happen when running commands, and adapting to changing needs. Existing tests either treat past conversation as simple background information or only allow programs to *read* data, not make changes, which isn't representative of a helpful database assistant.

What's the solution?

The researchers created a new benchmark called BIRD-INTERACT. This benchmark simulates a realistic environment where programs can ask for clarification, look up information about the database, and recover from errors all on their own, without a human needing to step in. It includes two testing modes: one where the program follows a set script of questions and answers, and another where the program decides when to ask questions or explore the database. They also created a large set of tasks that cover all the basic database operations – creating, reading, updating, and deleting data – and included tests to automatically check if the program's answers are correct.

Why it matters?

This work is important because it shows that current state-of-the-art language models, even very powerful ones like GPT-5, struggle with these more complex, interactive database tasks. The results emphasize that simply being good at answering single questions isn't enough; programs need to be able to effectively *interact* with users and the database to be truly useful in real-world applications.

Abstract

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT's difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.