Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal foundation models (MFMs) have revolutionized sequential
recommender systems through advanced representation learning. While
Parameter-efficient Fine-tuning (PEFT) is commonly used to adapt these models,
studies often prioritize parameter efficiency, neglecting GPU memory and
training speed. To address this, we introduced the IISAN framework,
significantly enhancing efficiency. However, IISAN was limited to symmetrical
MFMs and identical text and image encoders, preventing the use of
state-of-the-art Large Language Models. To overcome this, we developed
IISAN-Versa, a versatile plug-and-play architecture compatible with both
symmetrical and asymmetrical MFMs. IISAN-Versa employs a Decoupled PEFT
structure and utilizes both intra- and inter-modal adaptation. It effectively
handles asymmetry through a simple yet effective combination of group
layer-dropping and dimension transformation alignment. Our research
demonstrates that IISAN-Versa effectively adapts large text encoders, and we
further identify a scaling effect where larger encoders generally perform
better. IISAN-Versa also demonstrates strong versatility in our defined
multimodal scenarios, which include raw titles and captions generated from
images and videos. Additionally, IISAN-Versa achieved state-of-the-art
performance on the Microlens public benchmark. We release our code at
https://github.com/GAIR-Lab/IISAN.