arxiv_ai 95% Match Research Paper AI Researchers,ML Engineers,NLP Practitioners 1 week ago

Code-enabled language models can outperform reasoning models on diverse tasks

large-language-models › reasoning

📄 Abstract

Abstract: Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.

Authors (5)

Cedegao E. Zhang

Cédric Colas

Gabriel Poesia

Joshua B. Tenenbaum

Jacob Andreas

Submitted

October 23, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper demonstrates that standard instruct LLMs, when combined with the CodeAdapt framework and few-shot learning, can achieve reasoning capabilities comparable to or exceeding dedicated reasoning models without requiring further fine-tuning. This approach significantly reduces the computational cost and data requirements for achieving strong reasoning performance.

Business Value

Enables more efficient and cost-effective deployment of powerful reasoning capabilities in LLM-based applications, potentially reducing infrastructure costs and improving user experience for tasks requiring complex reasoning.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

High, as it leverages existing instruct LLMs and a simple recipe (CodeAdapt framework + few-shot learning) without requiring extensive retraining.

Limitations Addressed

High computational cost of training RMs,Slow and expensive inference of RMs,Need for fine-tuning for improved reasoning

Performance Gains

Three LMs outperformed their corresponding RMs on average over eight tasks.

Technical Tags

Large Language ModelsCode ExecutionIn-Context LearningFew-Shot LearningReasoningInstruction FollowingCreative GenerationMathematical ReasoningCodeAdaptCodeAct

Research Topics

Improving LLM ReasoningCode-Assisted Language ModelsFew-Shot Learning for LLMsLLM EfficiencyLLM Adaptation

Methods & Architectures

CodeAdapt frameworkCodeAct frameworkFew-shot bootstrap in-context learning Instruct LLMsReasoning Models (RMs)

Applications & Tasks

Natural Language Processing Artificial Intelligence Research LLM Reasoning LimitationsLLM Training CostsLLM Inference CostsLLM Performance on Diverse Tasks Instruction FollowingCreative GenerationMathematical ReasoningCode Execution

Related Fields

Natural Language ProcessingMachine LearningArtificial IntelligenceComputer Science

Keywords

Large Language ModelsReasoningCode ExecutionFew-Shot LearningIn-Context LearningCodeAdaptCodeActInstruction FollowingCreative GenerationMathematical ReasoningLLM EfficiencyModel Adaptation

Academic Context

#Improving LLM Reasoning#Code-Assisted Language Models#Few-Shot Learning for LLMs#LLM Efficiency#LLM Adaptation

Technology Stack

Frameworks & Libraries

CodeAdaptCodeAct

Commercial Potential

Potential Products

Enhanced AI assistantsAdvanced content generation toolsAutomated problem-solving systems

Target Industries

TechnologySoftware DevelopmentCustomer ServiceEducation

Use Case Examples

Automated code generation and debuggingComplex question answeringCreative writing assistanceMathematical problem solving

Competitive Edge

Offers a more efficient alternative to specialized reasoning models by enhancing general-purpose instruct LLMs, reducing the need for separate, computationally intensive models.

Market Opportunity

Large (driven by the broad adoption of LLMs)

Revenue Models

Enhanced product featuresreduced operational costs

Resource Requirements

Compute Needs

Moderate (compared to training RMs from scratch)

Data Requirements

Small, curated few-shot examples

Scalability

Scales with the underlying LLM's capabilities and the efficiency of the CodeAdapt framework.

Production Readiness

Maturity Level

Research/Experimental

Time to Market

Short to Medium (for integration into existing LLM products)

View Full Paper Back to Papers