Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While pre-trained visual representations have significantly advanced
imitation learning, they are often task-agnostic as they remain frozen during
policy learning. In this work, we explore leveraging pre-trained text-to-image
diffusion models to obtain task-adaptive visual representations for robotic
control, without fine-tuning the model itself. However, we find that naively
applying textual conditions - a successful strategy in other vision domains -
yields minimal or even negative gains in control tasks. We attribute this to
the domain gap between the diffusion model's training data and robotic control
environments, leading us to argue for conditions that consider the specific,
dynamic visual information required for control. To this end, we propose ORCA,
which introduces learnable task prompts that adapt to the control environment
and visual prompts that capture fine-grained, frame-specific details. Through
facilitating task-adaptive representations with our newly devised conditions,
our approach achieves state-of-the-art performance on various robotic control
benchmarks, significantly surpassing prior methods.
Authors (5)
Heeseong Shin
Byeongho Heo
Dongyoon Han
Seungryong Kim
Taekyung Kim
Submitted
October 17, 2025
Key Contributions
This paper explores using pre-trained text-to-image diffusion models for robotic control without fine-tuning. It proposes the ORCA framework, which uses learnable task and visual prompts to create task-adaptive representations, overcoming the limitations of naive textual conditioning due to the domain gap between diffusion models and robotic environments.
Business Value
Enables robots to learn complex tasks more effectively from demonstrations by leveraging powerful pre-trained generative models, potentially accelerating robot learning and deployment.