arxiv_ai 95% Match Research Paper AI Researchers,Robotics Engineers,HCI Researchers,ML Engineers 2 weeks ago

End-to-end Listen, Look, Speak and Act

large-language-models › multimodal-llms

📄 Abstract

Abstract: Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released upon acceptance.

Authors (7)

Siyin Wang

Wenyi Yu

Xianzhao Chen

Xiaohai Tian

Jun Zhang

Lu Lu

+1 more

Submitted

October 19, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Presents ELLSA, the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture. Utilizing a novel SA-MoE architecture, ELLSA enables human-like interaction patterns, fluid turn-taking, and concurrent generation, overcoming limitations of models that handle modalities separately or sequentially.

Business Value

Enables the creation of highly interactive and natural AI systems, such as advanced virtual assistants, more capable robots, and realistic virtual agents, enhancing user experience and enabling new applications.

Paper Metadata

Innovation Type

Architectural/Algorithmic

Deployment Feasibility

Challenging. Requires significant computational resources for training and inference due to the complexity of the SA-MoE architecture and multimodal processing. However, the unified architecture could simplify deployment compared to complex multi-component systems.

Limitations Addressed

Lack of models capable of simultaneous perception and generation across multiple modalities,Inability of existing models to handle full-duplex interaction,Difficulty in achieving fluid, human-like turn-taking and interruptions,Modality interference and semantic noise in multimodal systems

Technical Tags

Multimodal InteractionEnd-to-end LearningVision-Language ModelsSpeech ProcessingAction GenerationFull-duplexSA-MoE ArchitectureMixture-of-ExpertsUnified AttentionHuman-like Behavior

Research Topics

Multimodal AIHuman-AI InteractionGenerative ModelsEmbodied AIUnified Perception and Generation

Methods & Architectures

End-to-end model (ELLSA)SA-MoE (Self-Attention Mixture-of-Experts) architectureJoint multimodal perception and generationFull-duplex interaction SA-MoE (Self-Attention Mixture-of-Experts)Unified Attention BackboneMultimodal Transformer

Applications & Tasks

Human-Computer Interaction Robotics Virtual Assistants Embodied AI Human Simulation Simulating human-like multimodal interaction (listen, look, speak, act)Achieving full-duplex, fluid turn-taking and interruptionsIntegrating diverse modalities (vision, text, speech, action) within a single architecture Simultaneous perception and generation across modalitiesEnabling human-like interaction patternsFull-duplex communication

Datasets & Benchmarks

Benchmarks

Specific benchmarks for multimodal interaction tasks (details not provided in abstract)

Related Fields

Multimodal AIArtificial IntelligenceRoboticsNatural Language ProcessingComputer VisionSpeech Processing

Keywords

Multimodal InteractionEnd-to-end LearningVision-Language ModelsSpeech ProcessingAction GenerationFull-duplexSA-MoEMixture-of-ExpertsHuman-like AIEmbodied AIUnified ArchitectureGenerative Models

Academic Context

#Multimodal AI#Human-AI Interaction#Generative Models#Embodied AI#Unified Perception and Generation

Technology Stack

Frameworks & Libraries

ELLSASA-MoE

Commercial Potential

Potential Products

Next-generation virtual assistantsHumanoid robots with natural interaction capabilitiesAdvanced simulation environments for trainingInteractive AI characters in games or metaverse

Target Industries

TechnologyRoboticsEntertainmentCustomer ServiceHealthcare (e.g., elder care robots)

Use Case Examples

A robot that can listen to a command, look at an object, speak a confirmation, and perform an actionA virtual assistant that can engage in a natural, back-and-forth conversation while also processing visual inputAI characters in games that react realistically to player actions and speechSystems for training human-robot collaboration

Competitive Edge

Represents a significant advancement in multimodal AI by offering a unified, end-to-end, full-duplex architecture that enables more human-like interaction than previous modular or sequential approaches.

Market Opportunity

Enormous, as truly natural human-AI interaction is a key frontier for AI development.

Revenue Models

Licensing of core AI technologydevelopment of specialized interactive AI agentsplatform services

Resource Requirements

Compute Needs

Extremely high for training due to the complexity and scale of multimodal data and the SA-MoE architecture.

Data Requirements

Requires large-scale, synchronized multimodal datasets covering vision, text, speech, and action sequences.

Deployment Constraints

High computational cost for real-time inference,Need for specialized hardware (e.g., GPUs),Complexity of integrating diverse sensory inputs and motor outputs,Ensuring robustness across different environments and conditions

Scalability

The SA-MoE architecture is designed for scalability by routing computation, but overall model size and data requirements remain significant.

Regulatory Considerations

Ethical considerations for human-like AIData privacy for multimodal inputs

Production Readiness

Maturity Level

Research

Time to Market

4-7 years

Patent Potential

High for the SA-MoE architecture and end-to-end multimodal interaction methods.

View Full Paper Back to Papers