2603.12625v1 Mar 13, 2026 cs.IR

VLM4Rec: 대규모 시각-언어 모델을 활용한 추천 시스템을 위한 다중 모드 의미 표현

VLM4Rec: Multimodal Semantic Representation for Recommendation with Large Vision-Language Models

Ty Valencia

Citations: 0

h-index: 0

Burak Barlas

Citations: 0

h-index: 0

V. Singhal

Citations: 14

h-index: 2

Ruchi Bhatia

Citations: 0

h-index: 0

Wei Yang

Citations: 62

h-index: 4

다중 모드 추천 시스템은 일반적으로 텍스트 및 시각 정보를 결합하여 사용자 선호도를 더 잘 모델링하는 특징 융합 문제로 정의됩니다. 그러나 다중 모드 추천 시스템의 효과는 단순히 모드 간의 융합 방식뿐만 아니라, 항목 콘텐츠가 선호도 매칭에 부합하는 의미 공간으로 표현되는지에 따라 달라질 수 있습니다. 특히, 원시 시각 특징은 종종 외관 유사성을 보존하는 반면, 사용자 결정은 스타일, 재질, 사용 맥락과 같은 고차원 의미 요인에 의해 주로 결정됩니다. 이러한 점에 주목하여, 우리는 대규모 시각-언어 모델(VLM) 기반의 다중 모드 의미 표현(VLM4Rec)을 제안합니다. VLM4Rec은 다중 모드 항목 콘텐츠를 직접적인 특징 융합 대신 의미 정렬을 통해 구성하는 경량화된 프레임워크입니다. VLM4Rec은 먼저 대규모 시각-언어 모델을 사용하여 각 항목 이미지를 명시적인 자연어 설명으로 변환하고, 변환된 의미 정보를 선호도 기반 검색을 위한 밀집된 항목 표현으로 인코딩합니다. 추천은 이후 간단한 프로필 기반의 의미 매칭 메커니즘을 통해 과거 항목 임베딩을 사용하여 수행되며, 이를 통해 실용적인 오프라인-온라인 분해를 가능하게 합니다. 여러 다중 모드 추천 데이터 세트에 대한 광범위한 실험 결과, VLM4Rec은 원시 시각 특징 및 여러 융합 기반 대안보다 일관되게 성능이 향상되는 것으로 나타났습니다. 이는 이 설정에서 표현 품질이 융합 복잡성보다 더 중요하다는 것을 시사합니다. 코드: https://github.com/tyvalencia/enhancing-mm-rec-sys

Original Abstract

Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at https://github.com/tyvalencia/enhancing-mm-rec-sys.

0 Citations

0 Influential

27.493061443341 Altmetric

137.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!