Expertini Research Research
Artificial Intelligence And Data Science PDF Available Non-peer-reviewed Preprint

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui  ·  Published 2025-11-13

Abstract

The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), emphasizing the role of high-quality training data. This shift is especially critical in the Text-to-SQL task, where the scarcity, limited diversity, and structural simplicity of existing datasets constrain model performance. To address these challenges, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that systematically generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from limited seed data. Our framework spans six augmentation dimensions and integrates an end-to-end pipeline with auxiliary database selection, SQL executability verification, natural language (NL) question generation, NL-SQL correspondence verification, and chain-of-thought (CoT) reasoning trace generation. Leveraging this framework, we construct SQLFlow, a high-quality dataset comprising 75,386 annotated examples. We demonstrate the utility of SQLFlow in both fine-tuning and prompt-based settings. (1) For open-source large language models (LLMs), fine-tuning with SQLFlow improves problem-solving ability, delivering competitive gains across multiple benchmarks under the same data budget. (2) For closed-source LLMs, we propose a masked alignment retrieval method that uses SQLFlow as both a knowledge base and training data for the retrieval model, enabling structure-aware example matching via fine-grained NL-SQL alignments. Experiments show that our retrieval strategy outperforms existing example retrieval methods, highlighting the combined value of SQLFlow's data quality and our retrieval technique. Overall, our work provides a scalable, data-centric foundation for advancing Text-to-SQL systems and underscores the importance of structured, high-fidelity data in modern AI development. Our code is available at https://github.com/TechNomad-ds/Text2SQL-Flow.
📄 Full Paper Available as PDF
This paper is available as a downloadable PDF.
📄 Download PDF

✨ AI Plain-English Summary

Get a plain-English summary of this paper generated by AI (5 free per day).

Comments (0)

No comments yet. Be the first to comment.

Related Papers

Artificial Intelligence And Data Science PDF

Sparse matrix-variate Gaussian process blockmodels for network modeling

2012
Artificial Intelligence And Data Science PDF

Hierarchical Maximum Margin Learning for Multi-Class Classification

2012
Artificial Intelligence And Data Science PDF

Tightening MRF Relaxations with Planar Subproblems

2012
Artificial Intelligence And Data Science PDF

Rank/Norm Regularization with Closed-Form Solutions: Application to ...

2012