arxiv_ai 93% Match Research Paper NLP researchers,LLM developers,ML engineers,Data scientists 2 weeks ago

DVAGen: Dynamic Vocabulary Augmented Generation

large-language-models › training-methods

📄 Abstract

Abstract: Language models trained with a fixed vocabulary struggle to generalize to novel or out-of-vocabulary words, limiting their flexibility in handling diverse token combinations. Existing dynamic vocabulary approaches attempt to address this limitation but face challenges such as fragmented codebases, lack of support for modern LLMs, and limited inference scalability. To overcome these issues, we introduce DVAGen, a fully open-source, unified framework designed for training, evaluation, and visualization of dynamic vocabulary-augmented language models. Our framework modularizes the pipeline for ease of customization, integrates seamlessly with open-source LLMs, and is the first to provide both CLI and WebUI tools for real-time result inspection. We validate the effectiveness of dynamic vocabulary methods on modern LLMs and demonstrate support for batch inference, significantly improving inference throughput.

Authors (7)

Wei Du

Nuowei Liu

Jie Wang

Jiahao Kuang

Tao Ji

Xiaoling Wang

+1 more

Submitted

October 20, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces DVAGen, a fully open-source, unified framework for training, evaluation, and visualization of dynamic vocabulary-augmented language models. It addresses challenges like fragmented codebases and lack of support for modern LLMs, modularizes the pipeline, integrates with open-source LLMs, and provides CLI/WebUI tools, while demonstrating improved inference scalability.

Business Value

Enables LLMs to handle a wider range of text data, including specialized domains or evolving language, leading to more robust and versatile NLP applications.

Paper Metadata

Innovation Type

Framework/Tooling

Deployment Feasibility

High, as it's a framework designed for training and inference, aiming for scalability.

Limitations Addressed

LLMs struggling to generalize to novel/OOV words,Fragmented codebases for dynamic vocabulary approaches,Lack of support for modern LLMs,Limited inference scalability

Performance Gains

Significantly improved inference throughput; enhanced generalization to OOV words.

Technical Tags

Dynamic VocabularyLanguage ModelsOut-of-Vocabulary (OOV) wordsLLM AugmentationDVAGen frameworkOpen-sourceUnified FrameworkInference ScalabilityBatch InferenceCLIWebUI

Research Topics

Natural Language ProcessingLanguage Model TrainingVocabulary ManagementLLM GeneralizationEfficient Inference

Methods & Architectures

DVAGen frameworkDynamic vocabulary augmentationTrainingEvaluationVisualizationBatch inference optimization Language Models (LLMs)

Applications & Tasks

Natural Language Processing Text Generation Machine Translation Information Retrieval Handling Out-of-Vocabulary wordsGeneralization to novel tokensInference scalabilityCodebase fragmentation Improving LLM generalization to OOV wordsProviding a unified framework for dynamic vocabulary methodsEnhancing inference throughput

Related Fields

Computational LinguisticsMachine Learning EngineeringSoftware Development

Keywords

Dynamic vocabularyLanguage modelsOOV wordsLLM generalizationDVAGenOpen-source frameworkInference scalabilityBatch inferenceNLPText generationUnified pipelineCLIWebUI

Academic Context

#Natural Language Processing#Language Model Training#Vocabulary Management#LLM Generalization#Efficient Inference

Commercial Potential

Potential Products

Customizable LLM training pipelinesTools for adapting LLMs to new domains

Target Industries

TechnologyPublishingMedia any industry using NLP

Use Case Examples

Training LLMs for medical text analysisAdapting LLMs for low-resource languagesImproving chatbots' ability to understand slang or jargon

Competitive Edge

Provides a unified, open-source solution that integrates training, evaluation, and visualization for dynamic vocabulary methods, overcoming fragmentation issues in existing approaches.

Market Opportunity

Large and growing market for LLM development and customization.

Revenue Models

Support and consulting services around the frameworkpotential for enterprise versions.

Resource Requirements

Compute Needs

Moderate to High, for training LLMs.

Data Requirements

Text corpora for training and evaluating language models.

Deployment Constraints

Integration with existing LLM architectures, computational resources for training.

Scalability

Focuses on improving inference scalability and throughput.

Regulatory Considerations

None explicitly mentionedstandard data privacy applies.

Production Readiness

Maturity Level

Research Framework

Time to Market

1-2 years for adoption and integration

Licensing

Open Source

Patent Potential

Low, as it's an open-source framework, but specific algorithmic improvements within it could be patentable.

View Full Paper Back to Papers