Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Current Vision-Language-Action (VLA) models are often constrained by a rigid,
static interaction paradigm, which lacks the ability to see, hear, speak, and
act concurrently as well as handle real-time user interruptions dynamically.
This hinders seamless embodied collaboration, resulting in an inflexible and
unresponsive user experience. To address these limitations, we introduce
VITA-E, a novel embodied interaction framework designed for both behavioral
concurrency and nearly real-time interruption. The core of our approach is a
dual-model architecture where two parallel VLA instances operate as an ``Active
Model'' and a ``Standby Model'', allowing the embodied agent to observe its
environment, listen to user speech, provide verbal responses, and execute
actions, all concurrently and interruptibly, mimicking human-like multitasking
capabilities. We further propose a ``model-as-controller'' paradigm, where we
fine-tune the VLM to generate special tokens that serve as direct system-level
commands, coupling the model's reasoning with the system's behavior.
Experiments conducted on a physical humanoid platform demonstrate that VITA-E
can reliably handle complex interactive scenarios. Our framework is compatible
with various dual-system VLA models, achieving an extremely high success rate
on emergency stops and speech interruptions while also successfully performing
concurrent speech and action. This represents a significant step towards more
natural and capable embodied assistants.
Authors (18)
Xiaoyu Liu
Chaoyou Fu
Chi Yan
Chu Wu
Haihan Gao
Yi-Fan Zhang
+12 more
Submitted
October 21, 2025
Key Contributions
Introduces VITA-E, a novel embodied interaction framework enabling concurrent seeing, hearing, speaking, and acting, along with dynamic interruption handling. Its dual-model architecture (Active/Standby) and 'model-as-controller' paradigm allow embodied agents to mimic human-like multitasking for more seamless and responsive interactions.
Business Value
Enables the development of more natural and intuitive embodied AI agents, such as advanced robots and virtual assistants, that can collaborate more effectively with humans in real-time.