Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 94% Match Research Paper Robotics Researchers,AI Developers,HRI Researchers,Developers of Embodied Agents 1 week ago

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

robotics › embodied-agents
📄 Abstract

Abstract: Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.
Authors (18)
Xiaoyu Liu
Chaoyou Fu
Chi Yan
Chu Wu
Haihan Gao
Yi-Fan Zhang
+12 more
Submitted
October 21, 2025
arXiv Category
cs.RO
arXiv PDF

Key Contributions

Introduces VITA-E, a novel embodied interaction framework enabling concurrent seeing, hearing, speaking, and acting, along with dynamic interruption handling. Its dual-model architecture (Active/Standby) and 'model-as-controller' paradigm allow embodied agents to mimic human-like multitasking for more seamless and responsive interactions.

Business Value

Enables the development of more natural and intuitive embodied AI agents, such as advanced robots and virtual assistants, that can collaborate more effectively with humans in real-time.