arxiv_ml 90% Match Research Paper NLP researchers,Speech processing engineers,AI developers,Researchers in large language models 2 weeks ago

SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

large-language-models › multimodal-llms

📄 Abstract

Abstract: Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.

Authors (3)

Kadri Hacioglu

Manjunath K E

Andreas Stolcke

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper explores the application of SpeechLLMs for large-scale contextualized zero-shot slot filling in Spoken Language Understanding (SLU). It establishes an empirical upper bound for the task, identifies performance gaps, and proposes improvements in training data, architecture, and strategies to narrow this gap. The work demonstrates the potential of SpeechLLMs for unified, generative, and zero-shot SLU.

Business Value

Enables more flexible and efficient spoken language understanding systems, reducing the need for extensive labeled data for new tasks and improving user experience in voice-enabled applications.

Paper Metadata

Innovation Type

Methodological and Empirical

Deployment Feasibility

Moderate. While SpeechLLMs show promise, large model sizes and computational requirements can be a barrier. Zero-shot capabilities reduce deployment overhead for new tasks.

Limitations Addressed

Performance gaps in zero-shot slot filling,Generalization to unseen slot labels,Data and compute inefficiency of traditional cascaded SLU systems,Lack of unified speech understanding models

Performance Gains

Substantial improvements in slot filling performance through proposed measures, narrowing the gap towards the empirical upper bound.

Technical Tags

SpeechLLMsSpoken Language Understanding (SLU)Slot FillingZero-shot LearningContextualizationLarge Language ModelsMultimodal AIGenerative ModelsInstruction Following

Research Topics

Speech UnderstandingNatural Language Understanding (NLU)Zero-shot LearningMultimodal AILarge Language Models

Methods & Architectures

SpeechLLMs (integrating speech and text foundation models)Generative approachInstruction followingData augmentationArchitecture improvementsTraining strategy optimization SpeechLLMs

Applications & Tasks

Spoken Dialogue Systems Virtual Assistants Customer Service Automation Speech Recognition and Understanding Improving zero-shot slot filling performanceGeneralizing to unseen slot labelsUnified speech understandingReducing data and compute efficiency gaps Slot fillingSpoken Language Understanding (SLU)Zero-shot generalization

Related Fields

Natural Language ProcessingSpeech ProcessingMachine LearningArtificial IntelligenceMultimodal AI

Keywords

SpeechLLMsspoken language understandingSLUslot fillingzero-shotlarge language modelsLLMmultimodalcontextualizationgenerativeinstruction followingNLU

Academic Context

#Speech Understanding#Natural Language Understanding (NLU)#Zero-shot Learning#Multimodal AI#Large Language Models

Commercial Potential

Potential Products

Next-generation voice assistantsAutomated customer service platformsIntelligent dialogue systemsSpeech-to-meaning APIs

Target Industries

TechnologyTelecommunicationsCustomer ServiceAutomotiveHealthcare

Use Case Examples

Enabling a voice assistant to understand complex, multi-turn commands without prior training for specific intents.Building a customer service chatbot that can handle a wide range of spoken queries with minimal adaptation.Developing in-car voice control systems that generalize to new commands.

Competitive Edge

Presents a unified, generative approach using SpeechLLMs that surpasses traditional cascaded SLU systems in flexibility and zero-shot capabilities, while addressing efficiency concerns.

Market Opportunity

Rapidly growing market for voice AI and conversational interfaces.

Revenue Models

API access to SpeechLLM capabilitiesintegration into SaaS productslicensing of specialized models.

Resource Requirements

Compute Needs

High, typical for large multimodal models.

Data Requirements

Large-scale speech and text datasets, potentially augmented for zero-shot generalization.

Deployment Constraints

Model size, inference latency, and computational resources required for deployment.

Scalability

Scalability depends on the underlying SpeechLLM architecture and available infrastructure; zero-shot nature aids scalability in terms of task adaptation.

Regulatory Considerations

Data privacy for voice datapotential biases in speech recognition and understanding.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust commercial products.

Patent Potential

Moderate

View Full Paper Back to Papers