Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 85% Match Research Paper Robotics Researchers,AI Engineers,System Integrators 1 week ago

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

robotics › navigation
📄 Abstract

Abstract: Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target-absent or ambiguous scenarios. We validate the framework through extensive simulation and real-world experiments on long-horizon object arrangement tasks, demonstrating zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial-ground robotic system that integrates VLM-based perception with LLM-driven reasoning for global high-level task planning and execution.
Authors (7)
Haokun Liu
Zhaoqi Ma
Yunong Li
Junichiro Sugihara
Yicheng Chen
Jinjie Li
+1 more
Submitted
June 5, 2025
arXiv Category
cs.RO
Advanced Intelligent Systems, Oct. 2025
arXiv PDF

Key Contributions

Proposes a hierarchical multimodal framework integrating a prompted LLM and a fine-tuned VLM for generalizable intelligence in heterogeneous multirobot systems. The LLM handles task decomposition and global semantic mapping, while the VLM provides semantic perception and object localization, enhanced by GridMask for improved spatial accuracy in manipulation.

Business Value

Enables more robust and adaptable robotic systems for complex tasks like search and rescue, logistics, or environmental monitoring, by allowing robots to understand and act upon high-level instructions in dynamic, real-world scenarios.