arxiv_ai 95% Match Research paper Robotics researchers,RL engineers,AI developers for automation 1 week ago

Human-in-the-loop Online Rejection Sampling for Robotic Manipulation

robotics › manipulation

📄 Abstract

Abstract: Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.

Authors (5)

Guanxing Lu

Rui Zhao

Haitao Lin

He Zhang

Yansong Tang

Submitted

October 30, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

Introduces Hi-ORS, a post-training method combining online rejection sampling and reward-weighted supervised training for VLA models in robotic manipulation. It stabilizes RL training by filtering bad samples and provides dense supervision, while enabling human-in-the-loop corrections for error recovery, achieving both stability and robustness.

Business Value

Enables faster and more reliable deployment of robots for complex manipulation tasks in manufacturing, logistics, and assembly lines, reducing training time and improving operational efficiency.

Paper Metadata

Innovation Type

Algorithmic framework and training methodology

Deployment Feasibility

High (post-training method, human-in-the-loop)

Limitations Addressed

Instability of RL training for VLA models due to inaccurate value estimates and sparse rewards; underperformance of pure IL due to its offline nature; difficulty in incorporating human feedback effectively.

Performance Gains

Achieves both training stability and high robustness across multiple real-world robotic manipulation tasks.

Technical Tags

Robotic manipulationReinforcement learning (RL)Imitation learning (IL)Vision-language-action (VLA) modelsRejection samplingOnline fine-tuningHuman-in-the-loopAsynchronous inference-trainingReward-weighted trainingError recovery

Research Topics

RoboticsReinforcement LearningImitation LearningHuman-Robot Interaction

Methods & Architectures

Human-in-the-loop Online Rejection Sampling (Hi-ORS)Rejection samplingOnline fine-tuningReward-weighted supervised trainingAsynchronous inference-training framework Vision-Language-Action (VLA) models

Applications & Tasks

Robotics Industrial Automation Human-Robot Collaboration Policy optimizationLearning from demonstrationsOnline adaptation Robotic manipulationLearning complex manipulation skillsImproving robustness and stability of RL training

Datasets & Benchmarks

Benchmarks

Three real-world tasks • Two embodiment platforms

Task success rateRobustnessTraining stability

Related Fields

RoboticsReinforcement LearningImitation LearningHuman-Computer Interaction

Keywords

roboticsmanipulationreinforcement learningimitation learningVLA modelsrejection samplinghuman-in-the-looponline learningrobotic controltask stability

Academic Context

#Robotics#Reinforcement Learning#Imitation Learning#Human-Robot Interaction

Commercial Potential

Potential Products

Advanced robotic control softwarePlatforms for training manipulation policiesHuman-robot collaboration tools

Target Industries

ManufacturingLogisticsWarehousingAutomotive

Use Case Examples

Training robots for precise assembly tasksEnabling robots to handle diverse objects in warehousesDeveloping robots that can learn from human demonstrations

Competitive Edge

Combines the strengths of RL and IL with human feedback in a novel online rejection sampling framework, offering improved stability and robustness over pure RL or IL methods.

Market Opportunity

Large and growing market for industrial robotics,Increasing demand for adaptable robotic systems

Revenue Models

Software licensingintegration services

Resource Requirements

Compute Needs

High (robot simulation and real-world training)

Data Requirements

Demonstration data, reward signals, potentially human feedback.

Deployment Constraints

Real-world robot hardware,Safety considerations,Human operator availability

Scalability

Scalability depends on the complexity of the manipulation task and the efficiency of the VLA model.

Regulatory Considerations

Workplace safety regulations

Production Readiness

Maturity Level

Research

Time to Market

2-3 years

View Full Paper Back to Papers