arxiv_ml 95% Match Research Paper LLM Researchers,AI Researchers,ML Engineers,NLP Practitioners 1 day ago

Diversity-Aware Policy Optimization for Large Language Model Reasoning

large-language-models › reasoning

📄 Abstract

Abstract: The reasoning capabilities of large language models (LLMs) have advanced rapidly, particularly following the release of DeepSeek R1, which has inspired a surge of research into data quality and reinforcement learning (RL) algorithms. Despite the pivotal role diversity plays in RL, its influence on LLM reasoning remains largely underexplored. To bridge this gap, this work presents a systematic investigation into the impact of diversity in RL-based training for LLM reasoning, and proposes a novel diversity-aware policy optimization method. Across evaluations on 12 LLMs, we observe a strong positive correlation between the solution diversity and Potential at k (a novel metric quantifying an LLM's reasoning potential) in high-performing models. This finding motivates our method to explicitly promote diversity during RL training. Specifically, we design a token-level diversity and reformulate it into a practical objective, then we selectively apply it to positive samples. Integrated into the R1-zero training framework, our method achieves a 3.5 percent average improvement across four mathematical reasoning benchmarks, while generating more diverse and robust solutions.

Authors (5)

Jian Yao

Ran Cheng

Xingyu Wu

Jibin Wu

Kay Chen Tan

Submitted

May 29, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper systematically investigates the impact of diversity on LLM reasoning, proposing a novel diversity-aware policy optimization method. It introduces a token-level diversity metric and a new metric 'Potential at k', demonstrating a strong positive correlation between solution diversity and reasoning potential, and showing how to leverage this for improved LLM reasoning.

Business Value

Enhances the reliability and capability of LLMs for complex reasoning tasks, leading to more robust AI assistants, better content generation, and improved problem-solving tools.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, as it's a training methodology that can be applied to existing LLMs.

Limitations Addressed

Underexplored influence of diversity on LLM reasoning,Need for improved LLM reasoning capabilities,Limitations in current RL training for LLMs

Technical Tags

LLM reasoningreinforcement learning (RL)diversitypolicy optimizationdata qualityPotential at ktoken-level diversityLLM trainingDeepSeek R1solution diversity

Research Topics

LLM ReasoningReinforcement Learning for LLMsAI AlignmentModel Training StrategiesDiversity in AI

Methods & Architectures

Diversity-aware policy optimizationToken-level diversity metricSelective application to positive samplesPotential at k metric Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing Artificial Intelligence Research Improving LLM reasoning capabilitiesUnderstanding the role of diversity in RL for LLMsEnhancing LLM performance through training LLM reasoningPolicy optimizationImproving model performance

Related Fields

Large Language ModelsReinforcement LearningArtificial IntelligenceNatural Language ProcessingMachine Learning

Keywords

LLMreasoningdiversityreinforcement learningRLpolicy optimizationlanguage modelstrainingperformancepotentialmetrictokenDeepSeekAINLP

Academic Context

#LLM Reasoning#Reinforcement Learning for LLMs#AI Alignment#Model Training Strategies#Diversity in AI

Commercial Potential

Potential Products

More capable LLM reasoning enginesSpecialized LLMs for complex problem-solvingTools for evaluating and improving LLM reasoning

Target Industries

TechnologySoftware DevelopmentResearchEducation

Use Case Examples

Developing LLMs that can solve complex mathematical problemsCreating AI assistants that can perform multi-step reasoningImproving the factual accuracy and logical consistency of LLM outputs

Competitive Edge

Introduces a novel approach to enhance LLM reasoning by explicitly incorporating diversity into the RL training process, addressing a gap in current research.

Market Opportunity

Very large, driven by the rapid growth and application of LLMs.

Revenue Models

Licensing of improved LLMsdevelopment of specialized AI reasoning services.

Resource Requirements

Compute Needs

Requires significant compute for RL training of LLMs.

Data Requirements

Requires diverse datasets and reasoning tasks for training and evaluation.

Deployment Constraints

Training complexity and computational cost.

Scalability

Scales with the size of the LLM and the complexity of the reasoning tasks.

Production Readiness

Maturity Level

Research

Time to Market

Medium term, requires integration into LLM training pipelines.

View Full Paper Back to Papers