📄 Abstract
Abstract: Human interaction is inherently multimodal and full-duplex: we listen while
watching, speak while acting, and fluidly adapt to turn-taking and
interruptions. Realizing these capabilities is essential for building models
simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act),
which, to our knowledge, is the first full-duplex, end-to-end model that
simultaneously perceives and generates across vision, text, speech, and action
within a single architecture, enabling interaction patterns previously out of
reach, yielding more natural, human-like behaviors. At its core is a novel
SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each
modality to specialized experts and fuses them through a unified attention
backbone. This provides a generalizable solution for joint multimodal
perception and concurrent generation, leveraging strong pre-trained components
while enabling efficient modality integration and mitigating modality
interference. On speech-interaction and robot-manipulation benchmarks, ELLSA
matches modality-specific baselines, while uniquely supporting advanced
multimodal and full-duplex behaviors such as dialogue and action turn-taking,
defective instruction rejection, speaking-while-acting, context-grounded visual
question answering, and action barge-ins. We contend that ELLSA represents a
step toward more natural and general interactive intelligence, contributing to
the broader pursuit of artificial general intelligence. All data, code and
model checkpoints will be released upon acceptance.
Authors (7)
Siyin Wang
Wenyi Yu
Xianzhao Chen
Xiaohai Tian
Jun Zhang
Lu Lu
+1 more
Submitted
October 19, 2025
Key Contributions
Presents ELLSA, the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture. Utilizing a novel SA-MoE architecture, ELLSA enables human-like interaction patterns, fluid turn-taking, and concurrent generation, overcoming limitations of models that handle modalities separately or sequentially.
Business Value
Enables the creation of highly interactive and natural AI systems, such as advanced virtual assistants, more capable robots, and realistic virtual agents, enhancing user experience and enabling new applications.