arxiv_ai 85% Match Research Paper Robotics Researchers,AI Engineers,System Integrators 1 week ago

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

robotics › navigation

📄 Abstract

Abstract: Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM). At the system level, the LLM performs hierarchical task decomposition and constructs a global semantic map, while the VLM provides semantic perception and object localization, where the proposed GridMask significantly enhances the VLM's spatial accuracy for reliable fine-grained manipulation. The aerial robot leverages this global map to generate semantic paths and guide the ground robot's local navigation and manipulation, ensuring robust coordination even in target-absent or ambiguous scenarios. We validate the framework through extensive simulation and real-world experiments on long-horizon object arrangement tasks, demonstrating zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments. To the best of our knowledge, this work presents the first heterogeneous aerial-ground robotic system that integrates VLM-based perception with LLM-driven reasoning for global high-level task planning and execution.

Authors (7)

Haokun Liu

Zhaoqi Ma

Yunong Li

Junichiro Sugihara

Yicheng Chen

Jinjie Li

+1 more

Submitted

June 5, 2025

arXiv Category

cs.RO

Advanced Intelligent Systems, Oct. 2025

arXiv PDF

Key Contributions

Proposes a hierarchical multimodal framework integrating a prompted LLM and a fine-tuned VLM for generalizable intelligence in heterogeneous multirobot systems. The LLM handles task decomposition and global semantic mapping, while the VLM provides semantic perception and object localization, enhanced by GridMask for improved spatial accuracy in manipulation.

Business Value

Enables more robust and adaptable robotic systems for complex tasks like search and rescue, logistics, or environmental monitoring, by allowing robots to understand and act upon high-level instructions in dynamic, real-world scenarios.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires integration of LLM and VLM components, potentially with significant computational resources. GridMask enhancement is a specific architectural change.

Limitations Addressed

Lack of generalizability in existing methods for diverse tasks and dynamic environments, and the need to bridge high-level reasoning with low-level execution across heterogeneous agents.

Technical Tags

Hierarchical Language ModelsMultimodal FrameworkVision-Language ModelsSemantic NavigationRobotic SystemsTask DecompositionObject LocalizationGridMaskAerial-Ground SystemsFine-grained Manipulation

Research Topics

Multi-robot CoordinationGeneralizable AIHuman-Robot InteractionPerception and ControlSemantic Mapping

Methods & Architectures

Hierarchical Task DecompositionPrompted LLMFine-tuned VLMGridMaskSemantic Path Generation Large Language Model (LLM)Vision-Language Model (VLM)

Applications & Tasks

Robotics Autonomous Systems Aerial Robotics Ground Robotics Lack of GeneralizabilityDynamic EnvironmentsHeterogeneous Multi-robot SystemsBridging High-level Reasoning and Low-level Execution Semantic NavigationManipulationTask DecompositionCoordinated Cooperation

Related Fields

Artificial IntelligenceRoboticsComputer VisionNatural Language ProcessingMulti-agent Systems

Keywords

multi-robot systemssemantic understandinglanguage modelsvision modelsrobot controltask planningaerial robotsground robotscoordinationgeneralizationmanipulationnavigationGridMaskLLMVLM

Academic Context

#Multi-robot Coordination#Generalizable AI#Human-Robot Interaction#Perception and Control#Semantic Mapping

Technology Stack

Frameworks & Libraries

LLMVLM

Commercial Potential

Potential Products

Advanced autonomous robotic systemsIntelligent logistics robotsSearch and rescue drones

Target Industries

LogisticsManufacturingDefenseAgricultureEmergency Services

Use Case Examples

Coordinated aerial and ground robot explorationRobots performing complex assembly tasksAutonomous delivery systems in challenging environments

Competitive Edge

Offers a more generalizable and adaptable approach compared to static or task-specific models for multi-robot systems, enabling complex task execution in dynamic environments.

Market Opportunity

Growing market for autonomous robots and AI-powered systems.

Revenue Models

Service contractslicensing of technologyhardware sales with integrated AI.

Resource Requirements

Compute Needs

High (for LLM and VLM inference)

Data Requirements

Requires diverse datasets for training/fine-tuning VLMs and potentially for LLM prompting.

Deployment Constraints

Real-time performance, computational resources, communication bandwidth between robots.

Scalability

Scalability to more robots and complex tasks depends on the LLM's reasoning capabilities and VLM's perception efficiency.

Regulatory Considerations

Safety standards for autonomous systemsdata privacy if environmental data is sensitive.

Production Readiness

Maturity Level

Research

Time to Market

2-5 years

Patent Potential

Moderate (for novel framework components like GridMask or specific integration strategies)

View Full Paper Back to Papers