Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper NLP researchers,LLM developers,ML engineers,Data scientists 2 weeks ago

DVAGen: Dynamic Vocabulary Augmented Generation

large-language-models › training-methods
📄 Abstract

Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.
Authors (7)
Wei Du
Nuowei Liu
Jie Wang
Jiahao Kuang
Tao Ji
Xiaoling Wang
+1 more
Submitted
October 20, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper introduces DVAGen, a fully open-source, unified framework for training, evaluation, and visualization of dynamic vocabulary-augmented language models. It addresses challenges like fragmented codebases and lack of support for modern LLMs, modularizes the pipeline, integrates with open-source LLMs, and provides CLI/WebUI tools, while demonstrating improved inference scalability.

Business Value

Enables LLMs to handle a wider range of text data, including specialized domains or evolving language, leading to more robust and versatile NLP applications.