Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Language models trained with a fixed vocabulary struggle to generalize to
novel or out-of-vocabulary words, limiting their flexibility in handling
diverse token combinations. Existing dynamic vocabulary approaches attempt to
address this limitation but face challenges such as fragmented codebases, lack
of support for modern LLMs, and limited inference scalability. To overcome
these issues, we introduce DVAGen, a fully open-source, unified framework
designed for training, evaluation, and visualization of dynamic
vocabulary-augmented language models. Our framework modularizes the pipeline
for ease of customization, integrates seamlessly with open-source LLMs, and is
the first to provide both CLI and WebUI tools for real-time result inspection.
We validate the effectiveness of dynamic vocabulary methods on modern LLMs and
demonstrate support for batch inference, significantly improving inference
throughput.
Authors (7)
Wei Du
Nuowei Liu
Jie Wang
Jiahao Kuang
Tao Ji
Xiaoling Wang
+1 more
Submitted
October 20, 2025
Key Contributions
This paper introduces DVAGen, a fully open-source, unified framework for training, evaluation, and visualization of dynamic vocabulary-augmented language models. It addresses challenges like fragmented codebases and lack of support for modern LLMs, modularizes the pipeline, integrates with open-source LLMs, and provides CLI/WebUI tools, while demonstrating improved inference scalability.
Business Value
Enables LLMs to handle a wider range of text data, including specialized domains or evolving language, leading to more robust and versatile NLP applications.