Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Integration of audio perception into large language models (LLMs) is an
emerging research area for enabling machine listening applications, yet
efficient transfer of rich audio semantics from audio encoders to LLMs remains
underexplored. The most widely used integration paradigm projects the audio
encoder output tokens into the LLM input space (e.g., via an MLP or a
Q-Former), then prepends or inserts them to the text tokens. We refer to this
generic scheme as Prepend to the LLM's input token space (PLITS) integration.
We propose an efficient alternative, Lightweight Audio LLM Integration (LAL).
LAL introduces audio representations solely via the attention mechanism within
different layers of the LLM, bypassing its feedforward module. LAL encodes rich
audio semantics at an appropriate level of abstraction for integration into
different blocks of LLMs. Our design significantly reduces computational
overhead compared to existing integration approaches. Observing with Whisper
that the speech encoder benefits from PLITS integration, we propose an audio
encoder aware approach for efficiently Probing Audio encoders via LLM (PAL),
which employs PLITS integration for Whisper and LAL for general audio encoders.
Under an identical training curriculum, LAL consistently maintains performance
or outperforms existing integration approaches across multiple base LLMs and
tasks. For general audio tasks, LAL improvement is up to 30% over a strong
PLITS baseline while reducing memory usage by up to 64.1% and increasing
throughput by up to 247.5%. Furthermore, for general audio-music-speech LLM,
PAL performs on par with a fully PLITS integration-based system but with
substantially improved computational and memory efficiency. Project page:
https://ta012.github.io/PAL/
Authors (7)
Tony Alex
Wish Suharitdamrong
Sara Atito
Armin Mustafa
Philip J. B. Jackson
Imran Razzak
+1 more
Key Contributions
Proposes Lightweight Audio LLM Integration (LAL), an efficient method for transferring audio semantics into LLMs by integrating audio representations via the attention mechanism across different LLM layers, bypassing the feedforward module. This significantly reduces computational overhead compared to standard PLITS integration.
Business Value
Enables the development of more efficient and capable multimodal AI systems that can understand and process both text and audio, leading to enhanced AI assistants, better transcription services, and new forms of human-computer interaction.