arxiv_ai 98% Match Research Paper Robotics researchers,AI researchers,Embodied AI developers,Computer vision engineers 1 week ago

BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

large-language-models › multimodal-llms

📄 Abstract

Abstract: Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities -- \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} -- via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families -- MLLMs, ELLMs, VLAs, and GMLMs -- achieving $\sim\!\textbf{6%}$ gains in digital tasks and $\sim\!\textbf{3%}$ in physical tasks.

Authors (18)

Wentao Tan

Bowen Wang

Heng Zhi

Chenyu Liu

Zhe Li

Jian Liu

+12 more

Submitted

October 28, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Introduces the Boundless Large Model (BLM$_1$), a multimodal spatial foundation model that unifies capabilities across digital-physical spaces, tasks, and embodiments. It preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control, addressing key limitations of current MLLMs and VLAs.

Business Value

Enables the development of more versatile and adaptable robots and AI agents that can operate seamlessly in both virtual and real-world environments, accelerating progress in robotics and human-AI collaboration.

Paper Metadata

Innovation Type

Model Architecture/Framework

Deployment Feasibility

Medium. Developing and deploying such a comprehensive model requires significant resources and expertise. However, the foundation model approach aims for broad applicability.

Limitations Addressed

Addresses the poor generalization of MLLMs across digital-physical spaces, the low-level action output of VLAs lacking robust reasoning, and the confinement of most ELLMs to digital spaces with poor physical world generalization.

Technical Tags

Multimodal Large Language Models (MLLMs)Embodied AIVision-Language Models (VLMs)Cross-space TransferCross-task LearningCross-embodiment GeneralizationSpatial Foundation ModelRobotics Control

Research Topics

Multimodal AIEmbodied AIRoboticsFoundation ModelsGeneralization in AI

Methods & Architectures

Integration of embodied knowledgeCross-space transfer mechanismsCross-task learning strategiesCross-embodiment generalization techniques Multimodal Large Language Model (MLLM)Embodied Large Language Model (ELLM)Vision-Language-Action Model (VLA)

Applications & Tasks

Robotics Embodied AI Human-Robot Interaction Virtual Environments Physical Environments Poor generalization across digital-physical spacesLimited high-level reasoning in VLAsPoor generalization of ELLMs to physical worldLack of unified models for cross-space and cross-embodiment tasks Controlling embodied agents across digital and physical spacesPerforming tasks in diverse embodimentsEnabling seamless transfer of knowledge between spaces and tasks

Related Fields

RoboticsArtificial IntelligenceComputer VisionNatural Language ProcessingEmbodied AIFoundation Models

Keywords

Multimodal LLMEmbodied AIRoboticsFoundation ModelCross-space TransferCross-embodimentVision-Language-ActionSpatial ReasoningGeneralizationControl

Academic Context

#Multimodal AI#Embodied AI#Robotics#Foundation Models#Generalization in AI

Commercial Potential

Potential Products

General-purpose embodied AI agentsRobotic control platformsSimulation environments for embodied AI training

Target Industries

RoboticsAutomationLogisticsManufacturingGaming (for virtual agents)

Use Case Examples

A robot that can understand instructions in natural language and perform tasks in both a simulated environment and the real worldAn AI agent that can learn from visual and textual input to control different robotic platformsDeveloping AI that can generalize navigation strategies across different physical spaces

Competitive Edge

Offers a unified approach to embodied AI that overcomes the fragmentation and generalization limitations of existing VLA and ELLM models by integrating cross-space, cross-task, and cross-embodiment capabilities.

Market Opportunity

Very large, as it targets the core of future AI and robotics capabilities.

Revenue Models

Licensing of the foundation modelAPI accessdevelopment of specialized embodied AI solutions.

Resource Requirements

Compute Needs

Very high, typical for large foundation models and embodied AI training.

Data Requirements

Requires diverse multimodal datasets, including visual, textual, and action data from various embodiments and environments.

Deployment Constraints

Complexity of the model, computational requirements for inference, and the challenges of sim-to-real transfer.

Scalability

As a foundation model, it is designed for broad applicability and potential scalability across various downstream tasks and embodiments.

Production Readiness

Maturity Level

Research

Time to Market

Long, due to the complexity and resource requirements.

View Full Paper Back to Papers