2603.01696v1 Mar 02, 2026 cs.CV

교차 모달 ID 매핑: 강화 학습을 통한 모달 변환 시 정보 손실 최소화

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Bo Zheng

Citations: 64

h-index: 5

Haonan Jia

Citations: 3

h-index: 1

Shichao Dong

Citations: 28

h-index: 3

Xin Dong

Citations: 310

h-index: 7

Zenghui Sun

Citations: 16

h-index: 2

Jinsong Lan

Citations: 2

h-index: 1

Jin Wang

Citations: 0

h-index: 0

Xiaoyong Zhu

Citations: 228

h-index: 9

Kaifu Zhang

Citations: 24

h-index: 3

대규모 시각-언어 모델(LVLM)은 생성된 이미지 설명에서 중요한 시각적 내용을 누락하거나 왜곡하는 경우가 많습니다. 이러한 정보 손실을 최소화하면 LVLM이 이미지의 세부 사항에 집중하여 정확한 설명을 생성하도록 유도할 수 있습니다. 그러나 모달 변환 과정에서의 정보 손실을 측정하는 것은 시각 콘텐츠와 텍스트 출력 간의 모달 격차로 인해 본질적으로 어렵습니다. 본 논문에서는 이미지 설명의 품질이 해당 설명을 사용하여 텍스트 검색을 통해 검색된 이미지 간의 유사성과 양의 상관관계를 가진다는 주장을 제시합니다. 이러한 통찰력을 바탕으로, 우리는 추가적인 주석 없이 이미지 캡셔닝을 향상시키는 강화 학습 프레임워크인 교차 모달 ID 매핑(CIM)을 제안합니다. 구체적으로, 이 방법은 갤러리 표현 일관성 및 쿼리-갤러리 이미지 관련성이라는 두 가지 관점에서 정보 손실을 정량적으로 평가합니다. 이러한 지표 하에서 학습된 LVLM은 정보 손실을 최소화하고 이미지에서 캡션으로의 ID 매핑을 달성하는 것을 목표로 합니다. 실험 결과는 제안된 방법이 지도 미세 조정과 비교하여 이미지 캡셔닝에서 우수한 성능을 보임을 보여줍니다. 특히, COCO-LN500 벤치마크에서 CIM은 Qwen2.5-VL-7B 모델에서 관계 추론 성능을 20% 향상시켰습니다. 논문이 채택되면 코드가 공개될 예정입니다.

Original Abstract

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!