Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 93% Match Research Paper NLP researchers,Data scientists,Database administrators,Machine learning engineers 1 week ago

DCMM-SQL: Automated Data-Centric Pipeline and Multi-Model Collaboration Training for Text-to-SQL Model

large-language-models › training-methods
📄 Abstract

Abstract: Text-to-SQL tasks have gained attractive improvements since the release of ChatGPT. Among them, agent-based frameworks have been widely used in this field. However, the impact of data-centric strategies on text-to-SQL tasks has rarely been explored. In this paper, we systemically design a fully automated data-centric pipeline for text-to-SQL tasks, including \emph{adaptive data repair}, which can automatically find and fix errors in the training dataset; and \emph{error data augmentation}, where we specifically diffuse and enhance erroneous data predicted by the initially trained models. Meanwhile, we propose a Multi-Model collaboration training schema, aiming to train multiple models with different augmented data, enabling them to possess distinct capabilities and work together to complement each other, because it has been found that the capability of a single fine-tuned model is very limited. Furthermore, we utilize an ensemble strategy to integrate the capabilities of multiple models to solve a multiple-choice question, aiming to further improve the accuracy of text-to-SQL tasks. The experiment results and ablation study have demonstrated the effectiveness of data-centric pipeline and Multi-Model(MM) interactive iterative strategies, achieving first place in lightweight text-to-SQL models (within 70B).
Authors (8)
Yuanzhen Xie
Liu Ye
Jiqun Chu
Mochi Gao
Hehuan Liu
Yunzhi Tan
+2 more
Submitted
October 27, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Proposes a fully automated data-centric pipeline for Text-to-SQL tasks, including adaptive data repair and error data augmentation. Introduces a multi-model collaboration training schema and ensemble strategy to improve model capabilities and complement individual model limitations.

Business Value

Automates and improves the quality of data used for training Text-to-SQL models, leading to more accurate and reliable natural language interfaces for databases, thus democratizing data access.