arxiv_ai 95% Match Research Paper NLP Researchers,Database Researchers,AI Engineers,Data Analysts 1 week ago

Falcon: A Comprehensive Chinese Text-to-SQL Benchmark for Enterprise-Grade Evaluation

large-language-models › evaluation

📄 Abstract

Abstract: We introduce Falcon, a cross-domain Chinese text-to-SQL benchmark grounded in an enterprise-compatible dialect (MaxCompute/Hive). It contains 600 Chinese questions over 28 databases; 77% require multi-table reasoning and over half touch more than four tables. Each example is annotated along SQL-computation features and Chinese semantics. For evaluation, we release a robust execution comparator and an automated evaluation pipeline, under which all current state-of-the-art large-scale models (including Deepseek) achieve accuracies of at most 50%. Major errors originate from two sources: (1) schema linking in large enterprise landscapes - hundreds of tables, denormalized fields, ambiguous column names, implicit foreign-key relations and domain-specific synonyms that make correct join/column selection difficult; and (2) mapping concise, colloquial Chinese into the exact operators and predicates required for analytics - e.g., choosing the correct aggregation and group-by keys, expressing time windows and granularities, applying unit conversions, handling NULLs and data-quality rules, and formulating nested or windowed subqueries. Falcon therefore targets Chinese-specific semantics and enterprise dialects (abbreviations, business jargon, fuzzy entity references) and provides a reproducible middle ground before full production deployment by using realistic enterprise schemas, query templates, an execution comparator, and an automated evaluation pipeline for end-to-end validation.

Authors (9)

Wenzhen Luo

Wei Guan

Yifan Yao

Yimin Pan

Feng Wang

Zhipeng Yu

+3 more

Submitted

October 23, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces Falcon, a comprehensive Chinese text-to-SQL benchmark specifically designed for enterprise environments, featuring complex queries requiring multi-table reasoning. It also provides a robust evaluation framework and highlights the significant performance gap of current state-of-the-art models on such challenging enterprise data.

Business Value

Enables more accurate and efficient data access for Chinese-speaking enterprise users, potentially democratizing data analysis and reducing reliance on specialized SQL developers.

Paper Metadata

Innovation Type

Benchmark and Evaluation Framework

Deployment Feasibility

High for the benchmark itself. Deployment of models that can effectively use this benchmark is currently challenging.

Limitations Addressed

Lack of realistic, enterprise-grade Chinese text-to-SQL benchmarks, difficulty in evaluating LLMs on complex, multi-table enterprise data, and the limitations of current models in schema linking and natural language to SQL mapping.

Performance Gains

Highlights performance *degradation* of current SOTA models (achieving at most 50% accuracy), indicating a need for improvement.

Technical Tags

text-to-SQLChinese NLPbenchmarkenterprise datamulti-table reasoningMaxComputeHiveLLM evaluation

Research Topics

Natural Language ProcessingDatabasesArtificial IntelligenceMachine LearningBenchmarkingEnterprise AI

Methods & Architectures

benchmark creationexecution comparatorautomated evaluation pipelinemulti-table reasoning evaluation Large Language Models (LLMs)

Applications & Tasks

enterprise data management business intelligence database querying text-to-SQL conversioncomplex query generationenterprise data access Chinese text-to-SQLenterprise-grade evaluationmulti-table reasoning

Datasets & Benchmarks

Datasets

Falcon

Benchmarks

Falcon benchmark (600 Chinese questions over 28 databases)

Accuracy (at most 50% for current SOTA models)

Related Fields

Database SystemsComputational LinguisticsData EngineeringAI Evaluation

Keywords

text-to-SQLChinesebenchmarkenterpriseMaxComputeHivemulti-table reasoningLLMevaluationnatural languagedatabasequeryingDeepseek

Academic Context

#Natural Language Processing#Databases#Artificial Intelligence#Machine Learning#Benchmarking#Enterprise AI

Companies & Organizations

Companies Mentioned

Deepseek

Commercial Potential

Potential Products

Intelligent data querying tools for enterprisesNatural language interfaces for databasesAI-powered business intelligence platforms

Target Industries

TechnologyFinanceE-commerceAny industry with large enterprise databases

Use Case Examples

Allowing business users to query sales data using natural Chinese questionsAutomating report generation from complex enterprise databasesEnabling faster data exploration for analysts

Competitive Edge

Establishes a new, challenging benchmark for text-to-SQL in the enterprise Chinese domain, highlighting the limitations of existing models and setting a new standard for evaluation.

Market Opportunity

Significant market for enterprise data management and business intelligence tools.

Revenue Models

N/A (benchmark); potential revenue from AI solutions that excel on this benchmark.

Resource Requirements

Compute Needs

High for training LLMs to perform well on this benchmark. Evaluation requires standard compute resources.

Data Requirements

The Falcon benchmark dataset itself.

Deployment Constraints

Current LLMs struggle to achieve high performance on this benchmark, limiting immediate practical deployment for complex enterprise queries.

Scalability

The benchmark is designed to be scalable to larger enterprise datasets and more complex query scenarios.

Regulatory Considerations

Data privacy and security when dealing with enterprise data.

Production Readiness

Maturity Level

Benchmark/Evaluation Standard

Time to Market

N/A (benchmark); 2-4 years for models to consistently achieve high performance.

Patent Potential

Low for the benchmark itself, but potential for patents on novel methods developed to address the benchmark's challenges.

View Full Paper Back to Papers