arxiv_cl 95% Match Research Paper NLP Researchers,Database Developers,Data Analysts,Business Intelligence Professionals 1 week ago

Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration

large-language-models › evaluation

📄 Abstract

Abstract: Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark's difficulty. All code and data are released at https://github.com/Aurora-slz/Real-World-SQL-Bench .

Authors (9)

Linzhuang Sun

Tianyu Guo

Hao Liang

Yuying Li

Qifeng Cai

Jingxuan Wei

+3 more

Submitted

October 30, 2025

arXiv Category

cs.DB

arXiv PDF

Key Contributions

This paper introduces DySQL-Bench, a novel benchmark for evaluating dynamic, multi-turn Text-to-SQL capabilities, addressing the limitations of static, single-turn systems in real-world interactive scenarios. The benchmark is built using an automated pipeline for task synthesis and verification, ensuring correctness and relevance for assessing evolving user intents in database exploration.

Business Value

Enables more intuitive and efficient data exploration for business users by allowing them to interact with databases using natural language, iteratively refining their queries based on intermediate results. This can lead to faster insights and better decision-making in finance and business analytics.

Paper Metadata

Innovation Type

Benchmark Creation

Deployment Feasibility

High, as it focuses on improving the interaction layer for existing database systems and LLMs, making it adaptable to current infrastructure.

Limitations Addressed

Static, single-turn Text-to-SQL systems that fail in real-world interactive scenarios where user intents evolve and queries need refinement over multiple turns.

Technical Tags

Text-to-SQLSQL GenerationMulti-turn InteractionDatabase ExplorationLLM BenchmarkingTask SynthesisNatural Language InterfaceInteractive Systems

Research Topics

Natural Language ProcessingDatabase ManagementArtificial IntelligenceMachine LearningHuman-Computer Interaction

Methods & Architectures

LLM-based task generationTwo-stage pipeline (task synthesis and verification)Tree representationsHuman evaluation Large Language Models (LLMs)

Applications & Tasks

Finance Business Analytics Database Querying Interactive Query RefinementEvolving User IntentsReal-world Database Exploration Text-to-SQLMulti-turn SQL Interaction

Datasets & Benchmarks

Datasets

DySQL-Bench

Correctness (human evaluation)

Related Fields

Database SystemsNatural Language UnderstandingHuman-Computer InteractionInformation Retrieval

Keywords

Text-to-SQLSQL GenerationMulti-turn DialogueInteractive QueryingDatabase ExplorationLLM BenchmarkingNatural Language InterfaceDynamic SystemsTask SynthesisUser Intent EvolutionBusiness IntelligenceFinancial Analytics

Academic Context

#Natural Language Processing#Database Management#Artificial Intelligence#Machine Learning#Human-Computer Interaction

Commercial Potential

Potential Products

Interactive SQL query buildersNatural language database interfacesBusiness intelligence tools

Target Industries

FinanceBusiness AnalyticsE-commerceHealthcare

Use Case Examples

A business analyst iteratively exploring sales data by asking follow-up questions to refine a report.A financial analyst querying a database to understand market trends over multiple steps.

Competitive Edge

Positions itself as a more realistic evaluation framework for Text-to-SQL systems compared to existing static benchmarks, focusing on the interactive and evolving nature of real-world database exploration.

Market Opportunity

Large, as efficient database interaction is crucial across many industries.

Revenue Models

Indirectly through improved tools and services for data analysis.

Resource Requirements

Compute Needs

Likely moderate to high for training/fine-tuning LLMs, but inference requirements depend on the specific LLM used.

Data Requirements

Requires structured database schemas and natural language queries, with a focus on multi-turn interactions.

Deployment Constraints

Integration with existing database systems and user interfaces.

Scalability

Scalability depends on the underlying LLM and the complexity of the database schemas and query interactions.

Regulatory Considerations

Data privacy and security when accessing sensitive databases.

Production Readiness

Maturity Level

Research/Development

Time to Market

N/A (benchmark)

Patent Potential

Low, as it is primarily a benchmark and methodology.

View Full Paper Back to Papers