2605.02035v1 May 03, 2026 cs.CL

기계 번역에서의 시각적 기반의 모호성 해결을 위한 다중 모드 데이터셋

A Multimodal Dataset for Visually Grounded Ambiguity in Machine Translation

Longyue Wang

Citations: 210

h-index: 6

Weihua Luo

Citations: 901

h-index: 14

Liang Ding

University of Sydney / JD Explore Academy

Citations: 5,343

h-index: 40

Jingheng Pan

Citations: 228

h-index: 3

Xintong Wang

Citations: 110

h-index: 2

Christian Biemann

Citations: 244

h-index: 4

다중 모드 기계 번역(MMT)에서 모호성 해결은 핵심적인 과제이며, 모델은 시각적 정보를 활용하여 모호한 표현을 의도된 의미로 정확하게 매핑해야 합니다. 기존 연구에서는 시각 정보의 역할을 뒷받침하는 벤치마크를 제안했지만, 데이터 품질 문제와 실제 번역 시나리오와의 불일치가 상당합니다. 또한, 기존의 모호성 중심 평가 방법은 개방형 번역에서 나타나는 다양한 유형의 모호성을 제대로 평가하기 어렵습니다. 이러한 한계점을 극복하기 위해, 본 연구에서는 시각적 정보에 의존하여 주석이 달린 모호한 소스 텍스트 구간을 해결해야 하는 2,500개의 신중하게 선별된 예시로 구성된 데이터셋인 VIDA (Visually-Dependent Ambiguity)를 제시합니다. 또한, LLM(Large Language Model)을 판별기로 사용하여 주석이 달린 모호한 표현이 구간 수준에서 정확하게 해결되었는지 확인하는 Disambiguation-Centric Metrics를 제안합니다. 최첨단 Large Vision Language Model 2개를 사용하여 일반적인 추론, 지도 학습(SFT), 그리고 본 연구에서 제안하는 체인 오브 쏘트(CoT) 지도 학습(CoT-SFT)을 수행한 결과, SFT는 전체 번역 품질을 향상시키지만, CoT-SFT는 특히 일반화된 데이터셋에서 모호성 해결 정확도 측면에서 더욱 일관된 성능 향상을 보여주었습니다. 이는 다양한 유형의 모호성을 해결하는 데 있어 CoT-SFT가 더 강력한 일반화 능력을 갖추고 있음을 시사합니다.

Original Abstract

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks that provide supportive evidence for the role of vision, we observe substantial issues in data quality and a mismatch with translation scenarios. Moreover, existing ambiguity-oriented evaluations are not well suited to broader ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art Large Vision Language Models under vanilla inference, supervised fine-tuning (SFT), and our chain-of-thought SFT (CoT-SFT) show that while SFT improves overall translation quality, CoT-SFT yields more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets, indicating a stronger generalization for resolving diverse ambiguity types.

0 Citations

0 Influential

20 Altmetric

100.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!