Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Heterogeneous multirobot systems show great potential in complex tasks
requiring coordinated hybrid cooperation. However, existing methods that rely
on static or task-specific models often lack generalizability across diverse
tasks and dynamic environments. This highlights the need for generalizable
intelligence that can bridge high-level reasoning with low-level execution
across heterogeneous agents. To address this, we propose a hierarchical
multimodal framework that integrates a prompted large language model (LLM) with
a fine-tuned vision-language model (VLM). At the system level, the LLM performs
hierarchical task decomposition and constructs a global semantic map, while the
VLM provides semantic perception and object localization, where the proposed
GridMask significantly enhances the VLM's spatial accuracy for reliable
fine-grained manipulation. The aerial robot leverages this global map to
generate semantic paths and guide the ground robot's local navigation and
manipulation, ensuring robust coordination even in target-absent or ambiguous
scenarios. We validate the framework through extensive simulation and
real-world experiments on long-horizon object arrangement tasks, demonstrating
zero-shot adaptability, robust semantic navigation, and reliable manipulation
in dynamic environments. To the best of our knowledge, this work presents the
first heterogeneous aerial-ground robotic system that integrates VLM-based
perception with LLM-driven reasoning for global high-level task planning and
execution.
Authors (7)
Haokun Liu
Zhaoqi Ma
Yunong Li
Junichiro Sugihara
Yicheng Chen
Jinjie Li
+1 more
Advanced Intelligent Systems, Oct. 2025
Key Contributions
Proposes a hierarchical multimodal framework integrating a prompted LLM and a fine-tuned VLM for generalizable intelligence in heterogeneous multirobot systems. The LLM handles task decomposition and global semantic mapping, while the VLM provides semantic perception and object localization, enhanced by GridMask for improved spatial accuracy in manipulation.
Business Value
Enables more robust and adaptable robotic systems for complex tasks like search and rescue, logistics, or environmental monitoring, by allowing robots to understand and act upon high-level instructions in dynamic, real-world scenarios.