arxiv_ai 95% Match Research Paper LLM researchers,Reinforcement learning practitioners,AI safety researchers,NLP engineers 2 weeks ago

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

large-language-models › reasoning

📄 Abstract

Abstract: Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.

Authors (8)

Jie Cheng

Gang Xiong

Ruixi Qiao

Lijun Li

Chao Guo

Junle Wang

+2 more

Submitted

April 21, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

The paper identifies summation-form credit assignment as the cause of reward hacking in PRM-based LLM fine-tuning and proposes PURE, which uses a novel min-form credit assignment. This approach alleviates reward hacking by limiting the value function range and distributing advantages more reasonably, leading to comparable performance with significantly reduced hacking.

Business Value

Enables more reliable and robust fine-tuning of LLMs for complex reasoning tasks, reducing the risk of unintended behaviors caused by reward hacking and leading to more trustworthy AI systems.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as it focuses on improving existing fine-tuning methods for LLMs, making them more practical for real-world applications requiring robust reasoning.

Limitations Addressed

Reward hacking issues with Process Reward Models (PRMs),Canonical summation-form credit assignment in RL,LLMs hacking steps with high rewards,Instability in reinforcement fine-tuning

Performance Gains

Achieves comparable performance to PRM-based approaches while significantly alleviating reward hacking.

Technical Tags

process reward models (PRMs)credit assignmentreinforcement learning (RL)reward hackingLLM reasoningmin-form credit assignmentsupervised reinforcement learningvalue functionadvantage distributiontest-time scaling

Research Topics

Large Language ModelsReinforcement LearningAI SafetyReasoningMachine Learning Theory

Methods & Architectures

PURE (Process sUpervised Reinforcement lEarning)Min-form credit assignmentProcess Reward Models (PRMs)Reinforcement Learning (RL) LLMs

Applications & Tasks

AI Reasoning LLM Fine-tuning Reward HackingCredit Assignment in RLImproving LLM Reasoning Robustness Enhancing LLM reasoning capabilitiesMitigating reward hacking in PRM-based fine-tuningImproving the stability of RL fine-tuning for LLMs

Related Fields

Artificial IntelligenceMachine LearningReinforcement LearningNatural Language ProcessingAI Safety

Keywords

LLM reasoningreinforcement learningprocess reward modelsreward hackingcredit assignmentmin-formPUREAI safetyfine-tuningLLM

Academic Context

#Large Language Models#Reinforcement Learning#AI Safety#Reasoning#Machine Learning Theory

Commercial Potential

Potential Products

More robust LLM reasoning enginesSafer LLM fine-tuning services

Target Industries

AI DevelopmentTechnologySoftware

Use Case Examples

Fine-tuning LLMs for complex problem-solving tasks like math or codingDeveloping AI assistants that can perform multi-step reasoning reliablyEnsuring LLMs do not exploit loopholes in reward functions during training

Competitive Edge

Offers a novel credit assignment mechanism (min-form) that directly addresses the reward hacking problem inherent in summation-form credit assignment used with PRMs, leading to more stable and effective LLM fine-tuning.

Resource Requirements

Compute Needs

Requires compute for LLM fine-tuning, likely involving GPUs.

Data Requirements

Reasoning task data suitable for PRM-based fine-tuning.

Deployment Constraints

The effectiveness might depend on the specific reasoning task and the LLM architecture.

Scalability

The proposed method is applied during fine-tuning, so its scalability depends on the scalability of the underlying RL algorithms and LLM training infrastructure.

View Full Paper Back to Papers