AIPapers.ai - AI Research Papers Daily

Today's Speech & Audio Research Top Papers

Wednesday, November 5, 2025

📊 Read Full Intelligence Reports:

Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space Model

Introduces a deep state space model for condition-invariant fMRI decoding of speech intelligibility. Achieves state-of-the-art performance, demonstrating condition-invariant neural codes across diverse listening environments for speech processing.

Data-driven Learning of Interaction Laws in Multispecies Particle Systems with Gaussian Processes: Convergence Theory and Applications

Develops a Gaussian process framework to learn interaction kernels in multispecies particle systems from trajectory data. Establishes convergence theory for single-species systems and extends to second-order models, enabling better multiscale modeling.

EchoLSTM: A Self-Reflective Recurrent Network for Stabilizing Long-Range Memory

Proposes EchoLSTM, a self-reflective recurrent network using output-conditioned gating to stabilize long-range memory. Enhances memory retention in sequences with noisy or misleading information, improving performance over standard LSTMs.

Interpretable end-to-end Neurosymbolic Reinforcement Learning agents

Instantiates the SCoBots framework for interpretable neurosymbolic RL agents, decomposing tasks into interpretable representations. Addresses shortcut learning in deep RL by using object-centric states, improving generalization.

ProtoTSNet: Interpretable Multivariate Time Series Classification With Prototypical Parts

Presents ProtoTSNet for interpretable multivariate time series classification using prototypical parts. Enhances ProtoPNet for critical domains like industry and medicine, providing accurate and understandable decisions.

CFL: On the Use of Characteristic Function Loss for Domain Alignment in Machine Learning

Introduces Characteristic Function Loss (CFL) for domain alignment in machine learning. Addresses distribution shift by learning models that perform well in real-world scenarios, improving robustness.

Sort by:

arxiv_ml

Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space Model

Abstract: Abstract: Clarifying the neural basis of speech intelligibility is critical for computational neuroscience and digital speech processing. Recent neuroimaging studies have shown that intelligibility modulates cortical activity beyond simple acoustics,...

#Computational Neuroscience#Speech Processing#Brain-Computer Interfaces#Machine Learning#Neuroimaging

17 hours ago

92%

arxiv_ml

Affordable EEG, Actionable Insights: An Open Dataset and Evaluation Framework for Epilepsy Patient Stratification

Abstract: Abstract: Access to clinical multi-channel EEG remains limited in many regions worldwide. We present NEUROSKY-EPI, the first open dataset of single-channel, consumer-grade EEG for epilepsy, collected in a South Asian clinical setting along with rich ...

#Medical AI#Biomedical Signal Processing#Epilepsy Diagnosis#Patient Stratification#Accessible Health Technology

17 hours ago

50%

arxiv_ml

QuPCG: Quantum Convolutional Neural Network for Detecting Abnormal Patterns in PCG Signals

Abstract: Abstract: Early identification of abnormal physiological patterns is essential for the timely detection of cardiac disease. This work introduces a hybrid quantum-classical convolutional neural network (QCNN) designed to classify S3 and murmur abnorma...

#Quantum Machine Learning#Biomedical Signal Processing#Cardiology#Pattern Recognition#Hybrid Quantum-Classical Models

17 hours ago

80%

arxiv_ml

NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning

Abstract: Abstract: Electroencephalography (EEG) and local field potentials (LFP) are two widely used techniques to record electrical activity from the brain. These signals are used in both the clinical and research domains for multiple applications. However, ...

#Automated Brain Signal Preprocessing#Unsupervised Artifact Removal in EEG/LFP#Improving Reproducibility in Neuroscience#Machine Learning for Biomedical Signals

17 hours ago

65%

arxiv_ai

Prevailing Research Areas for Music AI in the Era of Foundation Models

Abstract: Abstract: Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI communi...

#Music AI Research Frontiers#Foundation Models in Music#AI-Generated Music#Model Efficiency and Controllability#Multimodal Music Systems

17 hours ago

95%

arxiv_ml

H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings

Abstract: Abstract: Early detection of heart arrhythmia can prevent severe future complications in cardiac patients. While manual diagnosis still remains the clinical standard, it relies heavily on visual interpretation and is inherently subjective. In recent ...

#Biomedical Signal Processing#Deep Learning for Healthcare#Time Series Analysis#Medical Diagnosis#Signal Filtering

17 hours ago

85%

arxiv_ai

The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity

Abstract: Abstract: While generative models for music composition are increasingly capable, their adoption by musicians is hindered by text-prompting, an asynchronous workflow disconnected from the embodied, responsive nature of instrumental performance. To ad...

#Music Generation#Human-AI Interaction#Generative AI#Interactive Systems#Music Technology

17 hours ago

90%

arxiv_ai

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Abstract: Abstract: Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current ...

#Speech Generation Evaluation#Human-AI Interaction#Benchmark Design#Multimodal AI#Natural Language Processing

17 hours ago

92%

arxiv_ai

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

Abstract: Abstract: Emotion recognition from speech plays a vital role in the development of empathetic human-computer interaction systems. This paper presents a comparative analysis of lightweight transformer-based models, DistilHuBERT and PaSST, by classifyi...

#Speech Processing#Emotion Recognition#Machine Learning#Deep Learning#Human-Computer Interaction

17 hours ago

95%

arxiv_cv

As Good as It KAN Get: High-Fidelity Audio Representation

Abstract: Abstract: Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture u...

#Audio Representation Learning#Implicit Neural Representations#Generative Models#Deep Learning Architectures#Speech Synthesis

1 day ago

90%

arxiv_cl

KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

Abstract: Abstract: This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST)...

KIT #Low-resource Speech Translation#Data Augmentation#Model Adaptation#Cross-lingual Transfer#Speech Processing

1 day ago

85%

arxiv_cv

Merlin L48 Spectrogram Dataset

Abstract: Abstract: In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this...

#Machine Learning#Computer Vision (applied to audio)#Data Annotation#Dataset Creation#Multi-label Classification#Audio Analysis

1 day ago

75%

arxiv_cl

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Abstract: Abstract: ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the Parl...

#Corpus Linguistics#Speech Processing#Natural Language Processing#Slavic Languages#Linguistic Annotation

1 day ago

85%

arxiv_ml

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Abstract: Abstract: Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improv...

#speech processing#machine translation#multilingual AI#medical informatics#low-resource NLP

1 day ago

90%

arxiv_ml

ADNAC: Audio Denoiser using Neural Audio Codec

Abstract: Abstract: Audio denoising is critical in signal processing, enhancing intelligibility and fidelity for applications like restoring musical recordings. This paper presents a proof-of-concept for adapting a state-of-the-art neural audio codec, the Desc...

#High-fidelity audio denoising#Adapting neural audio codecs for denoising#Generative audio restoration#Improving intelligibility and fidelity

1 day ago

90%

arxiv_ml

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Abstract: Abstract: Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richnes...

#Speech Processing#Natural Language Processing#Artificial Intelligence#Human-Computer Interaction#Affective Computing

1 day ago

90%

arxiv_ml

Motion-Robust Multimodal Fusion of PPG and Accelerometer Signals for Three-Class Heart Rhythm Classification

Abstract: Abstract: Atrial fibrillation (AF) is a leading cause of stroke and mortality, particularly in elderly patients. Wrist-worn photoplethysmography (PPG) enables non-invasive, continuous rhythm monitoring, yet suffers from significant vulnerability to m...

#Biomedical Signal Processing#Machine Learning#Wearable Health Monitoring#Multimodal Learning#Cardiology

1 day ago

80%

arxiv_cv

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

Abstract: Abstract: Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various ...

#Speech Processing#Multimodal AI#Signal Enhancement#Acoustic Signal Processing#Computer Vision for Audio

2 days ago

95%

arxiv_cv

LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar

Abstract: Abstract: Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) ...

#Biomedical Signal Processing#Non-contact Sensing#Deep Learning for Signal Reconstruction#Cardiac Monitoring

2 days ago

80%

arxiv_ml

Representing Classical Compositions through Implication-Realization Temporal-Gestalt Graphs

Abstract: Abstract: Understanding the structural and cognitive underpinnings of musical compositions remains a key challenge in music theory and computational musicology. While traditional methods focus on harmony and rhythm, cognitive models such as the Impli...

#Computational Musicology#Music Theory#Cognitive Musicology#Machine Learning for Music#Pattern Recognition

2 days ago

90%

arxiv_ml

Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition

Abstract: Abstract: Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduc...

Saguenay St. Lawrence Marine Park Research Station #Bioacoustics#Signal Processing#Machine Learning for Audio#Marine Mammal Monitoring#Noise Reduction

2 days ago

80%

arxiv_ml

ESTformer: Transformer utilising spatiotemporal dependencies for electroencephalogram super-resolution

Abstract: Abstract: Towards practical applications of Electroencephalography (EEG), lightweight acquisition devices garner significant attention. However, EEG channel selection methods are commonly data-sensitive and cannot establish a unified sound paradigm f...

#Signal Processing#Biomedical Signal Analysis#Deep Learning#Time Series Analysis#Neuroscience Applications

2 days ago

70%

arxiv_ai

Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

Abstract: Abstract: Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on dec...

#Brain-Computer Interfaces (BCI)#Speech Synthesis#Neural Signal Decoding#Assistive Technology#Rehabilitation Engineering

2 days ago

90%

arxiv_ai

Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features

Abstract: Abstract: Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu...

#Affective Computing#Speech Processing#Machine Learning for Low-Resource Languages#Model Generalization#Feature Engineering

2 days ago

95%

arxiv_ai

Expressive Range Characterization of Open Text-to-Audio Models

Abstract: Abstract: Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most ...

#Generative Audio#Text-to-Audio Synthesis#Content Generation#AI Evaluation#Multimodal AI

2 days ago

95%

arxiv_cv

Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG

Abstract: Abstract: The subject of this work is to check how different types of music affect human emotions. While listening to music, a subjective survey and brain activity measurements were carried out using an EEG helmet. The aim is to demonstrate the impac...

#Affective Computing#Music Psychology#Neuroscience#Signal Processing#Human-Computer Interaction

5 days ago

90%

arxiv_cl

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Abstract: Abstract: The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension ...

#Speech Synthesis#Text-to-Speech#Evaluation Metrics#Human-Computer Interaction#Natural Language Processing

5 days ago

92%

arxiv_ai

Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking

Abstract: Abstract: Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Aut...

#Improving Robustness of Dialogue Systems#Handling ASR Errors in Dialogue#Data Augmentation for NLP#LLM Applications in Speech Processing#Dialogue State Tracking

5 days ago

95%

arxiv_cl

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Abstract: Abstract: Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works...

#Speech Recognition#Representation Learning#Noise Robustness#Disentangled Representations#Audio Signal Processing

6 days ago

95%

arxiv_cl

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model

Abstract: Abstract: Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P...

#Spoken Language Processing#Speech Foundation Models#Multitask Learning#Low-Resource NLP#Phonetics

6 days ago

95%

Loading more papers...

📚 You've reached the end of the papers list

Today's Speech & Audio Research Top Papers

Weekly Speech & Audio Research Top Papers

Weekly Executive Briefing

Monday, November 3, 2025

Tuesday, November 4, 2025

Wednesday, November 5, 2025

Condition-Invariant fMRI Decoding of Speech Intelligibility with Deep State Space Model

Affordable EEG, Actionable Insights: An Open Dataset and Evaluation Framework for Epilepsy Patient Stratification

QuPCG: Quantum Convolutional Neural Network for Detecting Abnormal Patterns in PCG Signals

NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning

Prevailing Research Areas for Music AI in the Era of Foundation Models

H-Infinity Filter Enhanced CNN-LSTM for Arrhythmia Detection from Heart Sound Recordings

The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity

Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play

Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study

As Good as It KAN Get: High-Fidelity Audio Representation

KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

Merlin L48 Spectrogram Dataset

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

ADNAC: Audio Denoiser using Neural Audio Codec

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Motion-Robust Multimodal Fusion of PPG and Accelerometer Signals for Three-Class Heart Rhythm Classification

Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar

Representing Classical Compositions through Implication-Realization Temporal-Gestalt Graphs

Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition

ESTformer: Transformer utilising spatiotemporal dependencies for electroencephalogram super-resolution

Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features

Expressive Range Characterization of Open Text-to-Audio Models

Exploring the correlation between the type of music and the emotions evoked: A study using subjective questionnaires and EEG

SP-MCQA: Evaluating Intelligibility of TTS Beyond the Word Level

Speak & Spell: LLM-Driven Controllable Phonetic Error Augmentation for Robust Dialogue State Tracking

Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

POWSM: A Phonetic Open Whisper-Style Speech Foundation Model