Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper Multimodal AI researchers,Computer vision researchers,NLP researchers,Machine learning engineers 1 week ago

PairUni: Pairwise Training for Unified Multimodal Language Models

large-language-models › multimodal-llms
📄 Abstract

Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
Authors (9)
Jiani Zheng
Zhiyang Teng
Xiangtai Li
Anran Wang
Yu Tian
Kunpeng Qiu
+3 more
Submitted
October 29, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper proposes PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs to better train Unified Vision-Language Models (UVLMs). It uses GPT-o3 for data augmentation to create aligned pairs and retrieved pairs, exposing cross-task semantic correspondences. A new RL variant, Pair-GPRO, leverages this paired structure for consistent policy learning.

Business Value

Enables the development of more capable and versatile multimodal AI systems that can both understand and generate content, leading to richer applications in areas like image captioning, visual question answering, and image generation.