Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 85% Match Research Paper Data Scientists,ML Engineers,Researchers in NLP and Data Mining 2 weeks ago

Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams

large-language-models › training-methods
📄 Abstract

Abstract: Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.
Authors (4)
Federica Granese
Benjamin Navet
Serena Villata
Charles Bouveyron
Submitted
April 10, 2025
arXiv Category
cs.LG
Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2025, Porto, Portugal, September 15-19, 2025, Proceedings, Part VII
arXiv PDF Code

Key Contributions

Introduces StreamETM, a novel approach for online topic modeling on data streams that merges models using unbalanced optimal transport and incorporates an online change point detection algorithm to identify topic shifts. It outperforms competitors on simulated and real-world data.

Business Value

Enables businesses to gain real-time insights from evolving text data sources like social media, news feeds, or customer feedback, allowing for agile responses to market trends.

View Code on GitHub