Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their
task generalization potential and domain-specific capabilities. However, the
current LLM post-training paradigm faces significant data challenges, including
the high costs of manual annotation and diminishing marginal returns on data
scales. Therefore, achieving data-efficient post-training has become a key
research question. In this paper, we present the first systematic survey of
data-efficient LLM post-training from a data-centric perspective. We propose a
taxonomy of data-efficient LLM post-training methods, covering data selection,
data quality enhancement, synthetic data generation, data distillation and
compression, and self-evolving data ecosystems. We summarize representative
approaches in each category and outline future research directions. By
examining the challenges in data-efficient LLM post-training, we highlight open
problems and propose potential research avenues. We hope our work inspires
further exploration into maximizing the potential of data utilization in
large-scale model training. Paper List:
https://github.com/luo-junyu/Awesome-Data-Efficient-LLM
Authors (11)
Junyu Luo
Bohan Wu
Xiao Luo
Zhiping Xiao
Yiqiao Jin
Rong-Cheng Tu
+5 more
Submitted
October 29, 2025
Key Contributions
This paper presents the first systematic survey of data-efficient LLM post-training from a data-centric perspective. It proposes a taxonomy of methods including data selection, quality enhancement, synthetic generation, distillation, compression, and self-evolving ecosystems, aiming to address the high costs and diminishing returns of current LLM post-training paradigms.
Business Value
By providing a structured overview of data-efficient LLM training, this research can help organizations reduce the significant costs associated with manual data annotation and improve the effectiveness of LLM deployment for specific tasks and domains.