arxiv_cl 95% Match Survey AI Researchers,Machine Learning Engineers,Data Scientists,LLM Developers 5 days ago

A Survey on Efficient Large Language Model Training: From Data-centric Perspectives

large-language-models › training-methods

📄 Abstract

Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM

Authors (11)

Junyu Luo

Bohan Wu

Xiao Luo

Zhiping Xiao

Yiqiao Jin

Rong-Cheng Tu

+5 more

Submitted

October 29, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper presents the first systematic survey of data-efficient LLM post-training from a data-centric perspective. It proposes a taxonomy of methods including data selection, quality enhancement, synthetic generation, distillation, compression, and self-evolving ecosystems, aiming to address the high costs and diminishing returns of current LLM post-training paradigms.

Business Value

By providing a structured overview of data-efficient LLM training, this research can help organizations reduce the significant costs associated with manual data annotation and improve the effectiveness of LLM deployment for specific tasks and domains.

Paper Metadata

Innovation Type

Survey and Taxonomy

Deployment Feasibility

High, as it focuses on improving existing training processes rather than introducing new hardware or complex deployment mechanisms.

Limitations Addressed

High costs of manual annotation,Diminishing marginal returns on data scales,Data challenges in LLM post-training

Technical Tags

data-centric AILLM post-trainingdata selectiondata qualitysynthetic data generationdata distillationdata compressionself-evolving dataefficient trainingmodel generalization

Research Topics

Large Language ModelsData EfficiencyModel TrainingAI ResearchMachine Learning

Methods & Architectures

Data selectionData quality enhancementSynthetic data generationData distillationData compressionSelf-evolving data ecosystems Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing Artificial Intelligence Data challenges in LLM post-trainingHigh cost of manual annotationDiminishing marginal returns on data scalesAchieving data-efficient post-training LLM post-trainingTask generalizationDomain-specific capabilities

Related Fields

Machine LearningNatural Language ProcessingData ScienceArtificial Intelligence

Keywords

Large Language ModelsLLMPost-trainingData-centric AIData efficiencyTrainingSurveyTaxonomyData selectionData qualitySynthetic dataData distillationModel generalizationNLP

Academic Context

#Large Language Models#Data Efficiency#Model Training#AI Research#Machine Learning

Commercial Potential

Potential Products

Tools for automated data curation for LLMsFrameworks for efficient LLM fine-tuning

Target Industries

TechnologyAI DevelopmentSoftware

Use Case Examples

Improving LLM performance for specific industry applications with less labeled dataReducing the cost of adapting LLMs to new domains

Competitive Edge

This survey provides a foundational understanding of data-centric approaches, positioning it as a key area for future research and development in LLM training.

Market Opportunity

Large and growing market for LLM development and deployment.

Revenue Models

Indirectly through improved LLM products and services.

Resource Requirements

Compute Needs

Moderate to High (for training/fine-tuning LLMs)

Data Requirements

Focus on efficient use and generation of datasets

Scalability

Focuses on improving scalability of LLM training through data efficiency.

Production Readiness

Maturity Level

Research

Time to Market

N/A (Research focus)

Patent Potential

Low (as it's a survey)

View Full Paper Back to Papers