Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Long-context language models unlock advanced capabilities in reasoning, code
generation, and document summarization by leveraging dependencies across
extended spans of text. However, a significant portion of readily available
long-text data lacks meaningful long-distance dependencies; most spans can be
predicted using only local context. Training on such data is inefficient,
making careful data selection crucial. Therefore, we introduce LongFilter, a
framework for curating training data tailored to long-context pretraining.
LongFilter measures the information gain provided by extended context by
contrasting model predictions under long-context versus short-context settings,
thereby identifying samples where long-range dependencies are essential.
Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show
that LongFilter efficiently selects high-quality data and yields substantial
improvements on benchmarks such as HELMET, LongBench, and RULER.
Authors (7)
Haoran Deng
Yingyu Lin
Zhenghao Lin
Xiao Liu
Yizhou Sun
Yi-An Ma
+1 more
Submitted
October 29, 2025
Key Contributions
This paper introduces LongFilter, a framework for curating training data specifically for long-context LLM pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B show that LongFilter efficiently selects high-quality data, yielding substantial improvements on long-context benchmarks.
Business Value
Enables the development of more capable LLMs that can process and understand longer documents and complex information, leading to better performance in tasks like summarization, code generation, and complex reasoning.