Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Unified vision-language models (UVLMs) must perform both understanding and
generation within a single architecture, but these tasks rely on heterogeneous
data and supervision, making it difficult to balance them during reinforcement
learning (RL). We propose PairUni, a unified framework that reorganizes data
into understanding-generation (UG) pairs and aligns optimization accordingly.
We first use GPT-o3 to augment single-task data, generating captions for
understanding samples and question-answer (QA) pairs for generation samples,
forming aligned pairs from the same instance. Additionally, for each generation
sample, we retrieve a semantically related understanding example to form a
retrieved pair, linking different but related data points. These paired
structures expose cross-task semantic correspondences and support consistent
policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware
variant based on Group Relative Policy Optimization. It assigns a similarity
score to each pair to modulate the advantage, strengthening learning from
well-aligned examples and reducing task interference. We curate a high-quality
dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on
the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on
various UVLMs, outperforming strong UVLM RL baselines. Code:
\href{https://github.com/Haochen-Wang409/PairUni}{github.com/Haochen-Wang409/PairUni}
Authors (9)
Jiani Zheng
Zhiyang Teng
Xiangtai Li
Anran Wang
Yu Tian
Kunpeng Qiu
+3 more
Submitted
October 29, 2025
Key Contributions
This paper proposes PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs to better train Unified Vision-Language Models (UVLMs). It uses GPT-o3 for data augmentation to create aligned pairs and retrieved pairs, exposing cross-task semantic correspondences. A new RL variant, Pair-GPRO, leverages this paired structure for consistent policy learning.
Business Value
Enables the development of more capable and versatile multimodal AI systems that can both understand and generate content, leading to richer applications in areas like image captioning, visual question answering, and image generation.