arxiv_ai 95% Match Research Paper ML Researchers,AI Engineers,Data Scientists,LLM Developers 2 weeks ago

CircuitSeer: Mining High-Quality Data by Probing Mathematical Reasoning Circuits in LLMs

large-language-models › reasoning

📄 Abstract

Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities, but scaling their performance often relies on massive reasoning datasets that are computationally expensive to train on. Existing data selection methods aim to curate smaller, high-quality subsets but often rely on costly external models or opaque heuristics. In this work, we shift the focus from external heuristics to the model's internal mechanisms. We find that complex reasoning tasks consistently activate a sparse, specialized subset of attention heads, forming core reasoning circuits. Building on this insight, we propose CircuitSeer, a novel data selection method that quantifies the reasoning complexity of data by measuring its influence on these crucial circuits. Extensive experiments on 4 models and 9 datasets demonstrate CircuitSeer's superiority. Notably, fine-tuning Qwen2.5-Math-7B on just 10% of data selected by our method achieves a 1.4-point gain in average Pass@1 over training on the full dataset, highlighting its efficiency and effectiveness.

Authors (6)

Shaobo Wang

Yongliang Miao

Yuancheng Liu

Qianli Ma

Ning Liao

Linfeng Zhang

Submitted

October 21, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

CircuitSeer is a novel data selection method that leverages the internal mechanisms of LLMs, specifically identifying 'reasoning circuits' formed by specialized attention heads. By quantifying data influence on these circuits, it enables more effective selection of high-quality training data, leading to significant performance gains with substantially less data.

Business Value

This method can drastically reduce the cost and time required to train high-performing LLMs, making advanced AI capabilities more accessible and accelerating development cycles in various industries.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

High. It's a data selection method that can be applied during the fine-tuning phase of existing LLM training pipelines.

Limitations Addressed

The high computational cost of training LLMs on massive datasets and the limitations of existing data selection methods (costly external models, opaque heuristics).

Performance Gains

1.4-point gain in average Pass@1 over training on full dataset (using 10% of data)

Technical Tags

LLM reasoningdata selectionattention headsreasoning circuitsfine-tuningmathematical reasoningcomputational efficiencyinterpretabilitymodel internalsPass@1

Research Topics

LLM EfficiencyData CurationModel InterpretabilityReasoning CapabilitiesMachine Learning Training

Methods & Architectures

probing attention headsidentifying reasoning circuitsquantifying data influencedata selection based on circuit activation Large Language Model (LLM)Transformer (implied by attention heads)

Applications & Tasks

AI Model Training Natural Language Understanding Mathematical Problem Solving Scaling LLM PerformanceReducing Training Data CostsImproving Data Selection Methods Select high-quality data for LLM fine-tuningQuantify reasoning complexity of dataImprove LLM performance with less data

Datasets & Benchmarks

Datasets

4 models and 9 datasets

Benchmarks

Pass@1

Pass@1average Pass@1 gain

Related Fields

Machine LearningNatural Language ProcessingDeep LearningModel InterpretabilityComputational Linguistics

Keywords

LLMreasoningdata selectionfine-tuningattentioncircuitsinterpretabilityefficiencymathematical reasoningcomputational costQwen2.5-Math-7B

Academic Context

#LLM Efficiency#Data Curation#Model Interpretability#Reasoning Capabilities#Machine Learning Training

Commercial Potential

Potential Products

Data curation tools for LLM trainingEfficient LLM fine-tuning services

Target Industries

AI DevelopmentSoftwareResearch & Development

Use Case Examples

Selecting the most effective data subsets for training specialized LLMs (e.g., math, coding)Reducing the computational budget for LLM fine-tuningImproving model generalization by focusing on critical reasoning paths

Competitive Edge

Offers a more principled and effective data selection method compared to random sampling, heuristics, or reliance on external models, by directly probing internal model mechanisms.

Market Opportunity

Large market for AI model training and optimization tools.

Revenue Models

Licensing of the CircuitSeer technologyconsulting services for efficient LLM training.

Resource Requirements

Compute Needs

Requires compute for probing attention heads and analyzing data influence, which is less than full training but still significant.

Data Requirements

Requires diverse datasets covering reasoning tasks, and access to the LLM being probed.

Deployment Constraints

Requires understanding of LLM internal architecture (attention mechanisms), potential for model-specific optimizations.

Scalability

Scalable to different LLM architectures and reasoning tasks.

Regulatory Considerations

None directly mentioned.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years for integration into training pipelines.

Patent Potential

High, for the CircuitSeer method and its application in data selection.

View Full Paper Back to Papers