Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models have revolutionized natural language processing through
self-supervised pretraining on massive datasets. Inspired by this success,
researchers have explored adapting these methods to speech by discretizing
continuous audio into tokens using neural audio codecs. However, existing
approaches face limitations, including high bitrates, the loss of either
semantic or acoustic information, and the reliance on multi-codebook designs
when trying to capture both, which increases architectural complexity for
downstream tasks. To address these challenges, we introduce FocalCodec, an
efficient low-bitrate codec based on focal modulation that utilizes a single
binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec
delivers competitive performance in speech resynthesis and voice conversion at
lower bitrates than the current state-of-the-art, while effectively handling
multilingual speech and noisy environments. Evaluation on downstream tasks
shows that FocalCodec successfully preserves sufficient semantic and acoustic
information, while also being well-suited for generative modeling. Demo samples
and code are available at https://lucadellalib.github.io/focalcodec-web/.
Authors (4)
Luca Della Libera
Francesco Paissan
Cem Subakan
Mirco Ravanelli
Submitted
February 6, 2025
Key Contributions
FocalCodec introduces an efficient low-bitrate speech codec using focal modulation and a single binary codebook, achieving competitive performance at bitrates as low as 0.16 kbps. This approach overcomes limitations of existing codecs by reducing complexity and improving quality for speech resynthesis and voice conversion, while handling multilingual speech and noisy environments effectively.
Business Value
Enables significant cost savings in data transmission and storage for voice communication and audio streaming services, especially in bandwidth-constrained environments.