arxiv_cl 96% Match Research Paper ML Researchers,NLP Engineers,LLM Developers,Data Scientists 5 days ago

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

large-language-models › training-methods

📄 Abstract

Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

Authors (7)

Haoran Deng

Yingyu Lin

Zhenghao Lin

Xiao Liu

Yizhou Sun

Yi-An Ma

+1 more

Submitted

October 29, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces LongFilter, a framework for curating training data specifically for long-context LLM pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B show that LongFilter efficiently selects high-quality data, yielding substantial improvements on long-context benchmarks.

Business Value

Enables the development of more capable LLMs that can process and understand longer documents and complex information, leading to better performance in tasks like summarization, code generation, and complex reasoning.

Paper Metadata

Innovation Type

Framework/Methodological

Deployment Feasibility

High, as it's a data curation technique that can be applied during the pretraining phase of LLMs.

Limitations Addressed

Inefficiency of training on long-text data lacking meaningful long-distance dependencies,Need for careful data selection for long-context pretraining,Difficulty in quantifying the value of long-range context

Performance Gains

Substantial improvements on benchmarks such as HELMET, LongBench, and RULER when using LLaMA-3-8B extended to 64K context length.

Technical Tags

Long-Context LLMsPretraining Data CurationInformation GainLong-Range DependenciesData FilteringLLaMAContext Window ExtensionHELMETLongBenchRULER

Research Topics

Large Language ModelsPretrainingData CurationLong-Context ModelingInformation Theory

Methods & Architectures

LongFilter frameworkMeasuring information gainContrasting long-context vs. short-context predictionsData curation for pretraining LLaMA-3-8B

Applications & Tasks

Natural Language Processing Text Generation Document Understanding Data CurationModel Training EfficiencyLong-Range Dependency Modeling Curating high-quality training data for long-context LLMsIdentifying data with meaningful long-distance dependenciesImproving efficiency of long-context pretraining

Datasets & Benchmarks

Datasets

Long-text data

Benchmarks

HELMET • LongBench • RULER

Performance on long-context benchmarksTraining efficiency

Related Fields

Natural Language ProcessingMachine LearningDeep LearningData ScienceInformation Theory

Keywords

long context llmpretrainingdata curationinformation gainlong-range dependenciesllmscontext windowdata filteringllm trainingnatural language processing

Academic Context

#Large Language Models#Pretraining#Data Curation#Long-Context Modeling#Information Theory

Commercial Potential

Potential Products

More capable LLMs for long-document analysisSpecialized models for summarization and code generation

Target Industries

TechnologyPublishingLegalFinanceResearch

Use Case Examples

Summarizing lengthy research papers or legal documentsGenerating code based on extensive project requirementsAnswering questions about large codebases

Competitive Edge

Offers a principled method for selecting pretraining data that specifically targets the improvement of long-context capabilities, leading to more efficient training and better performance on relevant tasks.

Market Opportunity

Massive market for LLMs and their underlying training data infrastructure.

Revenue Models

Not directly applicable as it's a training methodology.

Resource Requirements

Compute Needs

Moderate for data filtering, high for LLM pretraining.

Data Requirements

Large corpus of text data suitable for LLM pretraining.

Deployment Constraints

The LongFilter framework is applied during the pretraining phase, not directly in deployment.

Scalability

Scalability of the LongFilter framework depends on the efficiency of the underlying model inference and data processing.

Production Readiness

Maturity Level

Research

Time to Market

N/A (applied during model development)

Patent Potential

Low, as it's a data curation methodology.

View Full Paper Back to Papers