arxiv_ai 95% Match Research Paper AI Researchers,NLP Engineers,HCI Researchers,Speech Synthesis Developers 1 week ago

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

large-language-models › multimodal-llms

📄 Abstract

Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.

Authors (5)

Cheng Luo

Jianghui Wang

Bing Li

Siyang Song

Bernard Ghanem

Submitted

May 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Online Multimodal Conversational Response Generation (OMCRG), a novel task for synchronized verbal and non-verbal listener feedback. Proposes OmniResponse, an MLLM that leverages text as an intermediate modality and incorporates Chrono-Text Markup and TempoVoice for precise timestamping and synchronized speech/facial response generation.

Business Value

Enables more natural and engaging human-computer interactions, potentially improving virtual assistants, telepresence systems, and robotic companions by allowing them to provide more human-like, synchronized feedback.

Paper Metadata

Innovation Type

Task definition and model architecture

Deployment Feasibility

Requires significant computational resources for MLLMs and real-time processing, but the online nature suggests potential for interactive applications.

Limitations Addressed

Challenges in aligning generated audio with listeners' facial responses in real-time dyadic interactions.

Technical Tags

Multimodal Large Language ModelOnline Response GenerationSynchronized Speech and Facial ExpressionText-to-SpeechAutoregressive GenerationChrono-Text MarkupTempoVoiceDyadic Interactions

Research Topics

Multimodal AIConversational AIHuman-Computer InteractionSpeech SynthesisAffective Computing

Methods & Architectures

Autoregressive generationText-to-Speech (TTS)Multimodal fusionTimestamping Multimodal Large Language Model (MLLM)

Applications & Tasks

Human-Computer Interaction Virtual Assistants Robotics Telepresence Generating synchronized verbal and non-verbal feedbackAligning generated audio with facial responsesOnline response generation Multimodal conversational response generationListener feedback generationSynchronized audio-visual response

Datasets & Benchmarks

Datasets

ResponseNet

Related Fields

Natural Language ProcessingComputer VisionSpeech ProcessingAffective ComputingHuman-Robot Interaction

Keywords

Multimodal AIConversational AIResponse GenerationDyadic InteractionSynchronized SpeechFacial ExpressionsText-to-SpeechLarge Language ModelsOnline GenerationHuman-Computer InteractionAffective ComputingReal-time AI

Academic Context

#Multimodal AI#Conversational AI#Human-Computer Interaction#Speech Synthesis#Affective Computing

Commercial Potential

Potential Products

Advanced virtual assistantsInteractive chatbots with emotional expressionTelepresence robots with realistic feedback

Target Industries

TechnologyCustomer ServiceEntertainmentRobotics

Use Case Examples

A virtual assistant that nods and verbally responds in sync with a user's query.A robot companion that provides empathetic, synchronized feedback during a conversation.

Competitive Edge

Addresses the gap in generating synchronized multimodal responses online, going beyond text-only or audio-only generation by integrating facial expressions.

Market Opportunity

Growing market for advanced conversational AI and HCI solutions.

Revenue Models

Licensing of technologydevelopment of specialized AI services.

Resource Requirements

Compute Needs

High (for MLLM training and inference)

Data Requirements

Large-scale dyadic interaction datasets with synchronized multimodal data.

Deployment Constraints

Real-time processing latency, computational cost, need for synchronized input streams.

Scalability

Scalability depends on the underlying MLLM architecture and efficient inference techniques.

Production Readiness

Maturity Level

Research

Time to Market

Long (requires further development and integration)

Patent Potential

Moderate (novel task definition, specific model components)

View Full Paper Back to Papers