arxiv_ai 95% Match Research Paper LLM Researchers,AI Educators,Machine Learning Engineers 2 weeks ago

Code Execution as Grounded Supervision for LLM Reasoning

large-language-models › reasoning

📄 Abstract

Abstract: Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We propose a scalable method for generating a high-quality CoT supervision dataset by leveraging the determinism of program execution. Unlike existing reasoning dataset generation methods that rely on costly human annotations or error-prone LLM-generated CoT, our approach extracts verifiable, step-by-step reasoning traces from code execution and transforms them into a natural language CoT reasoning. Experiments on reasoning benchmarks across various domains show that our method effectively equips LLMs with transferable reasoning abilities across diverse tasks. Furthermore, the ablation studies validate that our method produces highly accurate reasoning data and reduces overall token length during inference by reducing meaningless repetition and overthinking.

Authors (3)

Dongwon Jung

Wenxuan Zhou

Muhao Chen

Submitted

June 12, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Proposes a scalable method to generate high-quality Chain-of-Thought (CoT) supervision data by leveraging the determinism of program execution. This approach extracts verifiable, step-by-step reasoning traces from code and converts them into natural language CoT, leading to LLMs with improved and transferable reasoning abilities.

Business Value

Enables more efficient and effective training of LLMs for complex reasoning tasks, potentially leading to more capable and reliable AI systems in areas requiring logical deduction.

Paper Metadata

Innovation Type

Methodology

Deployment Feasibility

High, as it's a data generation technique applicable to LLM training pipelines.

Limitations Addressed

Addresses the significant challenge of obtaining reliable and accurate reasoning supervision for LLMs, overcoming the limitations of costly human annotations and error-prone LLM-generated CoT.

Performance Gains

Effectively equips LLMs with transferable reasoning abilities,Produces highly accurate reasoning data,Reduces overall token length during inference

Technical Tags

Chain-of-Thought (CoT)LLM ReasoningProgram ExecutionSupervision DataCode ExecutionVerifiable TracesTransferable ReasoningAblation Studies

Research Topics

LLM Reasoning EnhancementSupervised LearningData GenerationProgram SynthesisAI Education

Methods & Architectures

Leveraging Program ExecutionExtracting Verifiable TracesTransforming Code Execution to Natural Language CoT Large Language Models (LLMs)

Applications & Tasks

AI Education LLM Training Automated Reasoning Obtaining Reliable CoT SupervisionCostly Human AnnotationsError-prone LLM-generated CoT Generating High-Quality CoT Supervision DatasetsEquipping LLMs with Transferable Reasoning Abilities

Datasets & Benchmarks

Benchmarks

Experiments on reasoning benchmarks across various domains

Effectiveness in equipping LLMs with transferable reasoning abilitiesAccuracy of reasoning dataReduction in token length during inference

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingSoftware EngineeringEducation Technology

Keywords

LLMReasoningChain-of-ThoughtSupervisionCode ExecutionProgram SynthesisData GenerationTransferable SkillsAI TrainingVerifiable Reasoning

Academic Context

#LLM Reasoning Enhancement#Supervised Learning#Data Generation#Program Synthesis#AI Education

Commercial Potential

Potential Products

Automated CoT Dataset Generation ToolsLLM Training Platforms with Enhanced Reasoning Capabilities

Target Industries

TechnologyEducationAI Development

Use Case Examples

Training an LLM to solve complex math problems by providing step-by-step reasoning derived from code execution.Developing AI tutors that can explain their reasoning process.

Competitive Edge

Offers a more scalable and reliable alternative to existing methods for generating CoT supervision data.

Market Opportunity

Large and growing market for LLM training and development tools.

Revenue Models

Licensing of the data generation methodologySaaS platforms for LLM training.

Resource Requirements

Compute Needs

Moderate for generating data, standard for LLM training.

Data Requirements

Requires code snippets and corresponding execution environments.

Deployment Constraints

The quality of generated CoT depends on the correctness and clarity of the underlying code.

Scalability

Highly scalable due to the automated nature of data generation from code execution.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years (for integration into LLM training pipelines)

Patent Potential

Moderate, the methodology for transforming code execution to CoT could be patentable.

View Full Paper Back to Papers