Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 90% Match Research Paper NLP researchers,Speech processing engineers,AI developers,Researchers in large language models 2 weeks ago

SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling

large-language-models › multimodal-llms
📄 Abstract

Abstract: Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.
Authors (3)
Kadri Hacioglu
Manjunath K E
Andreas Stolcke
Submitted
October 17, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper explores the application of SpeechLLMs for large-scale contextualized zero-shot slot filling in Spoken Language Understanding (SLU). It establishes an empirical upper bound for the task, identifies performance gaps, and proposes improvements in training data, architecture, and strategies to narrow this gap. The work demonstrates the potential of SpeechLLMs for unified, generative, and zero-shot SLU.

Business Value

Enables more flexible and efficient spoken language understanding systems, reducing the need for extensive labeled data for new tasks and improving user experience in voice-enabled applications.