2604.00513v2 Apr 01, 2026 cs.LG

MOON3.0: 추론 능력을 갖춘 다중 모드 표현 학습을 통한 전자 상거래 제품 이해

MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding

Chuan Yu

Citations: 95

h-index: 4

Junxian Wu

Citations: 21

h-index: 2

Chenghan Fu

Citations: 15

h-index: 3

Zhanheng Nie

Citations: 10

h-index: 2

Daoze Zhang

Citations: 10

h-index: 2

Bowen Wan

Citations: 12

h-index: 2

Wanxian Guan

Citations: 74

h-index: 4

Jian Xu

Citations: 22

h-index: 3

Bo Zheng

Citations: 29

h-index: 4

전자 상거래의 급속한 성장과 함께, 특정 작업에 특화된 표현보다는 일반적인 표현을 탐색하는 것에 대한 관심이 높아지고 있습니다. 최근의 다중 모드 대규모 언어 모델(MLLM)은 제품 이해 분야에서 상당한 발전을 이루었지만, 일반적으로 제품 정보를 전역 임베딩으로 암묵적으로 인코딩하는 특징 추출기로 사용되어, 미세한 속성을 파악하는 능력을 제한합니다. 따라서, MLLM의 추론 능력을 활용하여 미세한 제품 속성을 명시적으로 모델링하는 것은 상당한 잠재력을 가지고 있다고 주장합니다. 그러나, 다음과 같은 몇 가지 주요 과제들로 인해 이 목표를 달성하는 것은 여전히 어렵습니다: (i) 긴 문맥 추론은 모델의 주의를 원본 입력의 중요한 정보에 집중시키는 것을 약화시킵니다; (ii) 지도 학습(SFT)은 주로 경직된 모방을 장려하여 효과적인 추론 전략 탐색을 제한합니다; (iii) 세부적인 정보는 순방향 전파 과정에서 점진적으로 감쇠됩니다. 이러한 문제점을 해결하기 위해, 우리는 제품 표현 학습을 위한 최초의 추론 능력을 갖춘 MLLM 기반 모델인 MOON3.0을 제안합니다. 우리의 방법은 (1) 다중 헤드 모달리티 융합 모듈을 사용하여 원본 신호를 적응적으로 통합하고, (2) 조인트 콘트라스티브 및 강화 학습 프레임워크를 사용하여 보다 효과적인 추론 전략을 자율적으로 탐색하며, (3) 네트워크 전체에서 로컬 세부 정보를 점진적으로 보존하는 미세한 잔차 향상 모듈을 도입합니다. 또한, 대규모 다중 모드 전자 상거래 벤치마크인 MBE3.0을 공개합니다. 실험 결과, 우리의 모델은 자체 벤치마크 및 공개 데이터셋에서 다양한 하위 작업에 대해 최첨단 수준의 제로샷 성능을 보여줍니다.

Original Abstract

With the rapid growth of e-commerce, exploring general representations rather than task-specific ones has attracted increasing attention. Although recent multimodal large language models (MLLMs) have driven significant progress in product understanding, they are typically employed as feature extractors that implicitly encode product information into global embeddings, thereby limiting their ability to capture fine-grained attributes. Therefore, we argue that leveraging the reasoning capabilities of MLLMs to explicitly model fine-grained product attributes holds significant potential. Nevertheless, achieving this goal remains non-trivial due to several key challenges: (i) long-context reasoning tends to dilute the model's attention to salient information in the raw input; (ii) supervised fine-tuning (SFT) primarily encourages rigid imitation, limiting the exploration of effective reasoning strategies; and (iii) fine-grained details are progressively attenuated during forward propagation. To address these issues, we propose MOON3.0, the first reasoning-aware MLLM-based model for product representation learning. Our method (1) employs a multi-head modality fusion module to adaptively integrate raw signals; (2) incorporates a joint contrastive and reinforcement learning framework to autonomously explore more effective reasoning strategies; and (3) introduces a fine-grained residual enhancement module to progressively preserve local details throughout the network. Additionally, we release a large-scale multimodal e-commerce benchmark MBE3.0. Experimentally, our model demonstrates state-of-the-art zero-shot performance across various downstream tasks on both our benchmark and public datasets.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!