arxiv_ai 94% Match Research Paper AI Researchers,Machine Learning Engineers,Software Developers,Researchers in Reasoning 1 week ago

Once Upon an Input: Reasoning via Per-Instance Program Synthesis

large-language-models › reasoning

📄 Abstract

Abstract: Large language models (LLMs) excel at zero-shot inference but continue to struggle with complex, multi-step reasoning. Recent methods that augment LLMs with intermediate reasoning steps such as Chain of Thought (CoT) and Program of Thought (PoT) improve performance but often produce undesirable solutions, especially in algorithmic domains. We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without relying on task-specific guidance or explicit test cases. To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis. Experiments across three frontier LLMs and 30 benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question answering tasks, relational reasoning tasks, and mathematical reasoning tasks show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and 9.4% compared to PoT and CoT respectively, and reduces undesirable program generations by 65.1% on the algorithmic tasks compared to PoT with Gemini-2.0-Flash.

Authors (4)

Adam Stein

Neelay Velingker

Mayur Naik

Eric Wong

Submitted

October 26, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback without task-specific guidance. It incorporates a confidence metric to dynamically choose between direct inference and program synthesis, significantly improving accuracy on complex reasoning tasks.

Business Value

Enhances the capability of AI systems to solve complex problems, leading to more powerful AI assistants, automated code generation, and improved performance in scientific and engineering domains.

Paper Metadata

Innovation Type

Novel approach to program synthesis for LLM reasoning.

Deployment Feasibility

Feasible, as it builds on LLMs and program synthesis techniques. Requires integration into LLM inference pipelines.

Limitations Addressed

LLMs' struggles with complex, multi-step reasoning and the generation of undesirable solutions from existing methods like CoT/PoT.

Performance Gains

Improves absolute harmonic mean accuracy by up to X% (value truncated in abstract).

Technical Tags

Program SynthesisInstance-level reasoningLLM augmentationChain of Thought (CoT)Program of Thought (PoT)Structural feedbackConfidence metricZero-shot inferenceAlgorithmic reasoningBig Bench Hard

Research Topics

Artificial IntelligenceNatural Language ProcessingReasoningProgram SynthesisMachine Learning

Methods & Architectures

Per-Instance Program Synthesis (PIPS)Instance-level program generation and refinementStructural feedbackConfidence metric for dynamic inference choiceZero-shot learning LLM-based program synthesis

Applications & Tasks

Artificial Intelligence Software Development Problem Solving LLM struggles with complex, multi-step reasoningUndesirable solutions from CoT/PoTNeed for task-specific guidance or explicit test cases Complex reasoningAlgorithmic problem solvingMathematical reasoningRelational reasoningVisual question answering

Datasets & Benchmarks

Benchmarks

Big Bench Extra Hard (BBEH) • Visual question answering tasks • Relational reasoning tasks • Mathematical reasoning tasks

Absolute harmonic mean accuracy

Related Fields

Artificial IntelligenceProgram SynthesisReasoningMachine LearningFormal Methods

Keywords

LLMReasoningProgram SynthesisInstance-levelZero-shotChain of ThoughtProgram of ThoughtAlgorithmicMathematicalConfidence MetricBig BenchVQA

Academic Context

#Artificial Intelligence#Natural Language Processing#Reasoning#Program Synthesis#Machine Learning

Commercial Potential

Potential Products

AI-powered problem-solving toolsAutomated code generation systemsAdvanced reasoning engines

Target Industries

TechnologySoftware DevelopmentResearchEducation

Use Case Examples

Solving complex algorithmic challenges automatically.Generating programs for mathematical proofs or scientific simulations.Improving LLM performance on challenging reasoning benchmarks.

Competitive Edge

Offers a more robust and generalizable approach to LLM reasoning compared to methods relying solely on text generation (like CoT) by incorporating program synthesis.

Market Opportunity

Large, related to AI reasoning capabilities and automated problem solving.

Revenue Models

Licensing of the PIPS technologyintegration into AI platforms.

Resource Requirements

Compute Needs

High, for LLM inference and program synthesis/refinement.

Data Requirements

Requires benchmarks with complex reasoning tasks.

Deployment Constraints

Computational cost of program synthesis; potential for generating incorrect programs.

Scalability

Scalability depends on the efficiency of the program synthesis engine and the LLM.

Production Readiness

Maturity Level

Research/Prototype

Time to Market

2-4 years for practical applications.

Patent Potential

Moderate, if the synthesis and confidence mechanisms are novel.

View Full Paper Back to Papers