arxiv_cl 95% Match Research Paper Multimodal AI researchers,Computer vision researchers,NLP researchers,Machine learning engineers 1 week ago

PairUni: Pairwise Training for Unified Multimodal Language Models

large-language-models › multimodal-llms

📄 Abstract

Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}

Authors (9)

Jiani Zheng

Zhiyang Teng

Xiangtai Li

Anran Wang

Yu Tian

Kunpeng Qiu

+3 more

Submitted

October 29, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper proposes PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs to better train Unified Vision-Language Models (UVLMs). It uses GPT-o3 for data augmentation to create aligned pairs and retrieved pairs, exposing cross-task semantic correspondences. A new RL variant, Pair-GPRO, leverages this paired structure for consistent policy learning.

Business Value

Enables the development of more capable and versatile multimodal AI systems that can both understand and generate content, leading to richer applications in areas like image captioning, visual question answering, and image generation.

Paper Metadata

Innovation Type

Training Framework/Methodology

Deployment Feasibility

Moderate, as it requires specific training methodologies and potentially large multimodal datasets.

Limitations Addressed

Difficulty in balancing understanding and generation tasks in UVLMs,Challenges with heterogeneous data and supervision in RL,Lack of methods to leverage cross-task semantic correspondences

Technical Tags

Unified Vision-Language Models (UVLMs)Pairwise TrainingUnderstanding-Generation PairsReinforcement Learning (RL)Data AugmentationCross-task AlignmentPolicy LearningPair-GPROGroup Relative Policy Optimization

Research Topics

Training unified multimodal modelsBalancing understanding and generation tasksLeveraging paired data for LLMsImproving RL for multimodal tasks

Methods & Architectures

Framework developmentData pairing strategyReinforcement Learning variant (Pair-GPRO)Data augmentation Unified Vision-Language Models (UVLMs)

Applications & Tasks

Computer Vision Natural Language Processing Multimodal AI Difficulty balancing understanding and generation in UVLMsHeterogeneous data and supervision challengesIneffective RL for multimodal tasks Training UVLMsImproving performance on both vision-language understanding and generationAligning optimization for paired tasks

Related Fields

Multimodal AIComputer VisionNatural Language ProcessingReinforcement LearningDeep Learning

Keywords

Multimodal LLMVision-LanguageUVLMPairwise TrainingReinforcement LearningUnderstandingGenerationData AugmentationAlignmentPair-GPROComputer VisionNLP

Academic Context

#Training unified multimodal models#Balancing understanding and generation tasks#Leveraging paired data for LLMs#Improving RL for multimodal tasks

Commercial Potential

Potential Products

Advanced image captioning systemsVisual question answering toolsMultimodal content generation platforms

Target Industries

TechnologyMediaE-commerceRobotics

Use Case Examples

Generating detailed descriptions for imagesAnswering questions about visual contentCreating multimodal content for marketing

Competitive Edge

Addresses the specific challenge of training unified multimodal models by introducing a novel data pairing strategy and RL approach, offering a distinct training paradigm.

Market Opportunity

Rapidly growing multimodal AI market.

Revenue Models

N/A

Resource Requirements

Compute Needs

High (for training large multimodal models)

Data Requirements

Large-scale image-text datasets

Deployment Constraints

Computational resources for large multimodal models.

Scalability

Training large UVLMs is computationally intensive.

Regulatory Considerations

N/A

Production Readiness

Maturity Level

Research

Time to Market

N/A

Licensing

N/A

Patent Potential

Moderate (for novel training techniques)

View Full Paper Back to Papers