2603.17450v1 Mar 18, 2026 cs.IR

VLM2Rec: 다중 모드 순차 추천을 위한 시각-언어 모델 임베딩기의 모달리티 붕괴 해결

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Hwanjo Yu

Citations: 234

h-index: 9

Junyoung Kim

Citations: 1

h-index: 1

Woojoo Kim

Citations: 4

h-index: 1

Jaehyung Lim

Citations: 1

h-index: 1

Dongha Kim

Citations: 1

h-index: 1

다중 모드 환경에서의 순차 추천(SR)은 일반적으로 작은 크기의 고정된 사전 학습된 인코더에 의존하며, 이는 의미론적 용량을 제한하고 협업 필터링(CF) 신호가 아이템 표현에 완전히 통합되는 것을 방지합니다. 최근 대규모 언어 모델(LLM)이 고용량 임베더로서 성공을 거둔 사례에서 영감을 받아, 본 연구에서는 시각-언어 모델(VLM)을 CF에 민감한 다중 모드 인코더로 활용하여 SR을 수행하는 방법을 탐구합니다. 그러나 표준적인 대비 학습 기반의 미세 조정(SFT)은 VLM을 임베딩 생성에 적응시키고 CF 신호를 주입하는 과정에서 VLM이 가진 고유한 모달리티 붕괴 문제를 악화시킬 수 있음을 확인했습니다. 이러한 상태에서는 최적화가 단일 모달리티에 의해 지배되고 다른 모달리티는 저하되어, 궁극적으로 추천 정확도를 저해합니다. 이러한 문제를 해결하기 위해, 본 연구에서는 다중 모달리티 순차 추천을 위한 VLM 임베더 기반 프레임워크인 VLM2Rec을 제안합니다. VLM2Rec은 균형 잡힌 모달리티 활용을 보장하도록 설계되었습니다. 구체적으로, 최적화 과정에서의 기울기 불균형을 해결하기 위해 약한 모달리티 페널티 대비 학습(Weak-modality Penalized Contrastive Learning)을 도입하고, 모달리티 간의 기하학적 일관성을 유지하기 위해 교차 모드 관계 위상 정규화(Cross-Modal Relational Topology Regularization)를 적용했습니다. 광범위한 실험 결과, VLM2Rec은 다양한 시나리오에서 정확도와 안정성 측면에서 기존의 최첨단 모델보다 우수한 성능을 지속적으로 보여주었습니다.

Original Abstract

Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!