arxiv_ml 95% Match Research Paper Machine Learning Researchers,NLP Engineers,AI Researchers,Developers working with LLMs/LRMs 1 week ago

Think Outside the Policy: In-Context Steered Policy Optimization

large-language-models › training-methods

📄 Abstract

Abstract: Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts where confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advaned models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces Mixed-Policy GRPO with Implicit Expert Forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates Expert Region Reject Sampling to filter unreliable off-policy trajectories and Annealed Expert-Bonus Reward Shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances reinforcement learning performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs.

Authors (5)

Hsiu-Yuan Huang

Chenming Tang

Weijie Liu

Saiyong Yang

Yunfang Wu

Submitted

October 30, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

ICPO is a unified framework that enhances RLVR by leveraging the in-context learning capability of Large Reasoning Models (LRMs) to provide expert guidance using existing datasets. It introduces Mixed-Policy GRPO with Implicit Expert Forcing, expanding exploration beyond the current policy distribution without requiring external expert models or increasing computational cost significantly.

Business Value

Enables the development of more capable and robust reasoning models, leading to better AI assistants, more reliable content generation, and improved performance in complex decision-making tasks, with potentially lower training costs.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires integration into RL training pipelines for LRMs. The reliance on in-context learning leverages existing LRM capabilities.

Limitations Addressed

Limited exploration and narrow trajectory diversity in RLVR due to on-policy rollouts,High computational cost and inaccessibility of advanced expert models for guidance,Need for improved reasoning capabilities in LRMs

Performance Gains

Expands exploration and trajectory diversity in RLVR, leading to improved reasoning capabilities in LRMs compared to methods relying solely on on-policy rollouts or external expert models.

Technical Tags

Reinforcement Learning from Verifiable Rewards (RLVR)Large Reasoning Models (LRMs)In-Context LearningPolicy OptimizationExploration ExpansionTrajectory DiversityExpert GuidanceMixed-Policy GRPOImplicit Expert ForcingOn-policy Rollouts

Research Topics

Reinforcement LearningNatural Language ProcessingAI AlignmentModel TrainingReasoning

Methods & Architectures

In-Context Steered Policy Optimization (ICPO)Mixed-Policy GRPOImplicit Expert ForcingLeveraging LRM in-context learning capability Large Reasoning Models (LRMs)Policy Optimization algorithms (e.g., GRPO)

Applications & Tasks

Natural Language Processing AI Alignment Reasoning Tasks Dialogue Systems Limited exploration in RLVR due to on-policy rolloutsNarrow trajectory diversityHigh computational cost of using expert modelsInaccessibility of advanced expert models Improving reasoning capabilities of LRMs via RLVRExpanding policy coverage beyond current policy distributionProviding expert guidance using existing datasetsEnhancing exploration without external expert models

Related Fields

Reinforcement LearningLarge Language ModelsAI AlignmentNatural Language ProcessingMeta-LearningReasoning

Keywords

RLVRLarge Reasoning ModelsLRMsIn-Context LearningPolicy OptimizationExplorationTrajectory DiversityExpert GuidanceReinforcement LearningAI AlignmentGRPOICPO

Academic Context

#Reinforcement Learning#Natural Language Processing#AI Alignment#Model Training#Reasoning

Commercial Potential

Potential Products

Advanced reasoning engines for AI agentsTools for aligning LLMs with complex human instructionsFrameworks for developing more capable dialogue systems

Target Industries

TechnologyAI DevelopmentSaaSResearch

Use Case Examples

Training LLMs to solve complex multi-step reasoning problemsImproving the ability of AI assistants to handle nuanced queriesDeveloping AI systems that can generate more coherent and logical long-form text

Competitive Edge

Offers a novel approach to RLVR that enhances exploration and learning by leveraging the inherent in-context learning abilities of LRMs, overcoming limitations of traditional on-policy methods and costly external expert models.

Market Opportunity

Significant and growing market for advanced AI reasoning and alignment technologies.

Revenue Models

Licensing of the ICPO frameworkconsulting services for LRM development.

Resource Requirements

Compute Needs

Moderate to High, depending on the scale of the LRM and the RL training process.

Data Requirements

Existing datasets that can be leveraged for in-context expert guidance.

Deployment Constraints

Requires careful integration into LRM training pipelines,Potential for instability if expert guidance is not well-aligned

Scalability

Leverages the scalability of underlying LRMs and RL algorithms.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

View Full Paper Back to Papers