Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) have demonstrated impressive reasoning
capabilities, but scaling their performance often relies on massive reasoning
datasets that are computationally expensive to train on. Existing data
selection methods aim to curate smaller, high-quality subsets but often rely on
costly external models or opaque heuristics. In this work, we shift the focus
from external heuristics to the model's internal mechanisms. We find that
complex reasoning tasks consistently activate a sparse, specialized subset of
attention heads, forming core reasoning circuits. Building on this insight, we
propose CircuitSeer, a novel data selection method that quantifies the
reasoning complexity of data by measuring its influence on these crucial
circuits. Extensive experiments on 4 models and 9 datasets demonstrate
CircuitSeer's superiority. Notably, fine-tuning Qwen2.5-Math-7B on just 10% of
data selected by our method achieves a 1.4-point gain in average Pass@1 over
training on the full dataset, highlighting its efficiency and effectiveness.
Authors (6)
Shaobo Wang
Yongliang Miao
Yuancheng Liu
Qianli Ma
Ning Liao
Linfeng Zhang
Submitted
October 21, 2025
Key Contributions
CircuitSeer is a novel data selection method that leverages the internal mechanisms of LLMs, specifically identifying 'reasoning circuits' formed by specialized attention heads. By quantifying data influence on these circuits, it enables more effective selection of high-quality training data, leading to significant performance gains with substantially less data.
Business Value
This method can drastically reduce the cost and time required to train high-performing LLMs, making advanced AI capabilities more accessible and accelerating development cycles in various industries.