2604.05930v1 Apr 07, 2026 cs.CL

"거기에 무슨 의도가 있었군요": 거대 시각-언어 모델이 다중 모드 말장난을 이해할 수 있을까요?

"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Zhihui Fu

Citations: 77

h-index: 5

Naen Xu

Citations: 24

h-index: 3

Chunyi Zhou

Citations: 150

h-index: 6

Jun Wang

Citations: 10

h-index: 2

Tianyu Du

Citations: 1,763

h-index: 17

Jiayi Sheng

Citations: 27

h-index: 2

Changjiang Li

Stony Brook

Citations: 464

h-index: 11

Jinbao Li

Citations: 1

h-index: 1

Shouling Ji

Citations: 72

h-index: 5

Yuyuan Li

Citations: 75

h-index: 3

말장난은 다의성과 음성적 유사성을 활용하여 유머를 만들어내는 일반적인 수사적 표현입니다. 다중 모드 말장난에서 시각적 및 텍스트 요소는 서로 협력하여 문자 그대로의 의미를 명확히 하고 동시에 비유적인 의미를 불러일으킵니다. 시각-언어 모델(VLM)은 다중 모드 이해 및 생성에 널리 사용되지만, 엄격한 벤치마크 부족으로 인해 말장난 이해 능력에 대한 체계적인 연구는 부족했습니다. 이를 해결하기 위해, 우리는 먼저 다중 모드 말장난 생성 파이프라인을 제안합니다. 그런 다음, 다양한 유형의 말장난과 함께 적대적인 비-말장난 데이터로 구성된 데이터셋인 MultiPun을 소개합니다. 우리의 평가는 대부분의 모델이 실제 말장난과 이러한 데이터 간의 차이를 구별하는 데 어려움을 겪는다는 것을 보여줍니다. 또한, 말장난 이해 능력을 향상시키기 위한 프롬프트 수준 및 모델 수준 전략을 제안했으며, F1 점수에서 평균 16.5%의 향상을 보였습니다. 우리의 연구 결과는 교차 모드 추론을 통해 인간과 유사한 유머의 미묘함을 마스터하는 미래의 VLM을 개발하는 데 귀중한 통찰력을 제공합니다.

Original Abstract

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our findings provide valuable insights for developing future VLMs that master the subtleties of human-like humor via cross-modal reasoning.

1 Citations

0 Influential

8.5 Altmetric

43.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!