arxiv_cv 90% Match Research Paper Reinforcement Learning Researchers,AI Researchers,Robotics Engineers,Developers of advanced AI agents 1 week ago

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

reinforcement-learning › training-methods

📄 Abstract

Abstract: Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) \textbf{Noise-Injected Exploration Policy}: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) \textbf{Bayesian Advantage Estimation}: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at \href{https://artanic30.github.io/project_pages/NoisyGRPO/}{\texttt{https://artanic30.github.io/project\_pages/NoisyGRPO}}.

Authors (4)

Longtian Qiu

Shan Ning

Jiaxuan Sun

Xuming He

Submitted

October 24, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

NoisyGRPO is a multimodal RL framework that enhances Chain-of-Thought reasoning by introducing controllable noise into visual inputs for better exploration and using Bayesian estimation for advantage calculation. This approach improves generalization beyond the training distribution by treating noise level as a prior.

Business Value

Leads to more reliable and adaptable AI systems that can reason effectively in diverse situations, crucial for applications requiring complex decision-making and understanding, such as advanced robotics and AI assistants.

Paper Metadata

Innovation Type

Reinforcement Learning Algorithm

Deployment Feasibility

Moderate. RL training can be complex and computationally intensive. The Bayesian approach adds complexity but potentially improves robustness.

Limitations Addressed

Struggle of existing RL frameworks to generalize beyond training distribution for CoT reasoning,Limited exploration in multimodal RL,Instability in advantage estimation

Technical Tags

Reinforcement Learning (RL)Multimodal CoT ReasoningNoise InjectionBayesian EstimationMLLMsExploration PolicyAdvantage EstimationGeneralizationVisual InputsChain-of-Thought

Research Topics

Reinforcement LearningMultimodal ReasoningAI AlignmentModel GeneralizationExplainable AI

Methods & Architectures

Noise-Injected Exploration PolicyBayesian Advantage EstimationGaussian Noise PerturbationPrincipled Bayesian inference Multimodal Large Language Models (MLLMs)

Applications & Tasks

AI Reasoning Robotics Autonomous Systems Natural Language Understanding Improving generalization of RL for CoT reasoningEnhancing exploration in multimodal RLRobust advantage estimationHandling distribution shifts Enhancing Chain-of-Thought reasoning in MLLMsImproving RL training stability and generalizationDeveloping more robust reasoning agents

Related Fields

Reinforcement LearningArtificial IntelligenceComputer VisionNatural Language ProcessingRobotics

Keywords

Reinforcement LearningMultimodal ReasoningChain-of-ThoughtNoise InjectionBayesian EstimationMLLMExplorationAdvantage EstimationGeneralizationAI SafetyRoboticsPerception

Academic Context

#Reinforcement Learning#Multimodal Reasoning#AI Alignment#Model Generalization#Explainable AI

Commercial Potential

Potential Products

AI agents with improved reasoning and decision-making capabilitiesRobotics control systemsAdvanced dialogue systems

Target Industries

RoboticsAutonomous SystemsGamingCustomer ServiceAI Research

Use Case Examples

Robots learning complex manipulation tasks in varied environmentsAI assistants that can follow multi-step instructionsImproving the robustness of AI models to noisy sensor data

Competitive Edge

Addresses the critical issue of generalization in multimodal RL for reasoning tasks by combining noise injection for exploration with a principled Bayesian approach to advantage estimation, offering a more robust training methodology.

Market Opportunity

Growing market for advanced AI reasoning and decision-making systems.

Revenue Models

Licensing of RL algorithmsdevelopment of AI agent platforms.

Resource Requirements

Compute Needs

High, typical for RL training on complex multimodal tasks.

Data Requirements

Requires environments and reward signals suitable for RL training, often involving simulated or real-world interaction data.

Deployment Constraints

Complexity of RL training, potential for unexpected behaviors, computational resources for inference.

Scalability

Scalability depends on the efficiency of the RL algorithm and the underlying MLLM.

Production Readiness

Maturity Level

Research Algorithm

Time to Market

2-4 years

Patent Potential

Moderate, for the specific combination of noise injection and Bayesian advantage estimation in multimodal RL.

View Full Paper Back to Papers