arxiv_cl 93% Match Research Paper NLP researchers,Data scientists,Database administrators,Machine learning engineers 1 week ago

DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

large-language-models › training-methods

📄 Abstract

Abstract: Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).

Authors (8)

Yuanzhen Xie

Liu Ye

Jiqun Chu

Mochi Gao

Hehuan Liu

Yunzhi Tan

+2 more

Submitted

October 27, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Proposes a fully automated data-centric pipeline for Text-to-SQL tasks, including adaptive data repair and error data augmentation. Introduces a multi-model collaboration training schema and ensemble strategy to improve model capabilities and complement individual model limitations.

Business Value

Automates and improves the quality of data used for training Text-to-SQL models, leading to more accurate and reliable natural language interfaces for databases, thus democratizing data access.

Paper Metadata

Innovation Type

Methodology and Framework

Deployment Feasibility

High, as it focuses on improving the training pipeline rather than a specific deployment architecture.

Limitations Addressed

The limited exploration of data-centric strategies in Text-to-SQL and the restricted capabilities of single fine-tuned models.

Technical Tags

Text-to-SQLData-centric AIAutomated data repairError data augmentationMulti-model collaborationEnsemble learningAgent-based frameworksChatGPTFine-tuningData pipeline

Research Topics

Data-Centric AIText-to-SQL GenerationModel CollaborationAutomated Data ManagementLLM Training Strategies

Methods & Architectures

Adaptive data repairError data augmentationMulti-model collaboration trainingEnsemble strategy Agent-based frameworksLarge Language Models (LLMs)

Applications & Tasks

Database Interaction Business Intelligence Data Analysis Data Quality ImprovementModel RobustnessTask Performance Enhancement Text-to-SQL generationAutomated data pipeline constructionMulti-model training

Related Fields

Natural Language ProcessingDatabase SystemsMachine LearningData EngineeringAI Ethics

Keywords

Text-to-SQLdata-centric AIdata repairdata augmentationmulti-model trainingensemble learningLLMChatGPTautomated pipelinedatabase querynatural language interface

Academic Context

#Data-Centric AI#Text-to-SQL Generation#Model Collaboration#Automated Data Management#LLM Training Strategies

Companies & Organizations

Companies Mentioned

OpenAI

Commercial Potential

Potential Products

Automated data cleaning and augmentation tools for NLPEnhanced Text-to-SQL generation systemsFrameworks for collaborative model training

Target Industries

TechnologyFinanceHealthcareE-commerce

Use Case Examples

Building a system that allows users to query company databases using natural languageAutomating the process of preparing training data for SQL generation modelsDeveloping a collaborative training framework for multiple specialized AI models

Competitive Edge

Addresses the under-explored area of data-centric strategies in Text-to-SQL, offering a novel automated pipeline and multi-model training approach.

Resource Requirements

Compute Needs

Moderate to High (depending on model size and data volume)

Data Requirements

Text-SQL pairs, potentially with identified errors.

Deployment Constraints

Requires robust data quality checks and careful management of multiple models.

Scalability

The data pipeline and multi-model training approach are designed for scalability.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers