Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Unified multimodal models have recently shown remarkable gains in both
capability and versatility, yet most leading systems are still trained from
scratch and require substantial computational resources. In this paper, we show
that competitive performance can be obtained far more efficiently by
strategically fusing publicly available models specialized for either
generation or understanding. Our key design is to retain the original blocks
while additionally interleaving multimodal self-attention blocks throughout the
networks. This double fusion mechanism (1) effectively enables rich multi-modal
fusion while largely preserving the original strengths of the base models, and
(2) catalyzes synergistic fusion of high-level semantic representations from
the understanding encoder with low-level spatial signals from the generation
encoder. By training with only ~ 35B tokens, this approach achieves strong
results across multiple benchmarks: 0.91 on GenEval for compositional
text-to-image generation, 82.16 on DPG-Bench for complex text-to-image
generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By
fully releasing the entire suite of code, model weights, and datasets, we hope
to support future research on unified multimodal modeling.
Authors (11)
Zeyu Wang
Zilong Chen
Chenhui Gou
Feng Li
Chaorui Deng
Deyao Zhu
+5 more
Submitted
October 27, 2025
Key Contributions
LightBagel proposes a lightweight, double fusion framework that achieves competitive performance in unified multimodal understanding and generation by strategically fusing publicly available specialized models. It interleaves multimodal self-attention blocks to enable synergistic fusion of high-level semantic and low-level spatial signals with significantly less training data (~35B tokens).
Business Value
Accelerates the development and deployment of versatile multimodal AI applications by leveraging existing models, reducing training costs and time, and enabling more accessible AI solutions.