Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 85% Match Research Paper Researchers in NLP and ML,Data Scientists,Information Retrieval Specialists 2 weeks ago

Stick-Breaking Embedded Topic Model with Continuous Optimal Transport for Online Analysis of Document Streams

generative-ai › flow-models
📄 Abstract

Abstract: Online topic models are unsupervised algorithms to identify latent topics in data streams that continuously evolve over time. Although these methods naturally align with real-world scenarios, they have received considerably less attention from the community compared to their offline counterparts, due to specific additional challenges. To tackle these issues, we present SB-SETM, an innovative model extending the Embedded Topic Model (ETM) to process data streams by merging models formed on successive partial document batches. To this end, SB-SETM (i) leverages a truncated stick-breaking construction for the topic-per-document distribution, enabling the model to automatically infer from the data the appropriate number of active topics at each timestep; and (ii) introduces a merging strategy for topic embeddings based on a continuous formulation of optimal transport adapted to the high dimensionality of the latent topic space. Numerical experiments show SB-SETM outperforming baselines on simulated scenarios. We extensively test it on a real-world corpus of news articles covering the Russian-Ukrainian war throughout 2022-2023.
Authors (3)
Federica Granese
Serena Villata
Charles Bouveyron
Submitted
October 21, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper introduces SB-SETM, an online topic model that extends ETM to handle data streams by merging models from successive document batches. It uniquely employs a stick-breaking construction for topic inference and a continuous optimal transport strategy for merging topic embeddings, enabling automatic topic number inference and effective handling of high-dimensional latent spaces.

Business Value

Enables real-time understanding of evolving trends and topics in large volumes of text data, such as news feeds, social media, or customer feedback, allowing for more agile decision-making and trend analysis.