2603.26052v1 Mar 27, 2026 cs.CV

픽셀과 단어의 연결: 마스크 기반 지역 의미 융합을 통한 다중 모드 미디어 검증

Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification

Yujia Liu

Citations: 0

h-index: 0

Ziyang Ren

Citations: 86

h-index: 4

Ping Wei

Citations: 107

h-index: 1

Huan Li

Citations: 238

h-index: 7

Xiang Yin

Citations: 70

h-index: 3

다중 모드 허위 정보가 더욱 정교해짐에 따라, 이를 탐지하고 근거를 제시하는 것이 중요해졌습니다. 그러나 현재의 다중 모드 검증 방법은 수동적인 전체 융합 방식을 사용하기 때문에 정교한 허위 정보에 취약합니다. '특징 희석' 현상으로 인해, 전체적인 정렬은 미묘한 지역적 의미 불일치를 평균화하여, 탐지하고자 하는 불일치를 가려내는 결과를 초래합니다. 본 연구에서는 MaLSF(Mask-aware Local Semantic Fusion)라는 새로운 프레임워크를 제안합니다. MaLSF는 능동적이고 양방향 검증을 통해 인간의 인지적 상호 참조 방식을 모방하며, 마스크-라벨 쌍을 의미적 앵커로 사용하여 픽셀과 단어를 연결합니다. MaLSF의 핵심 메커니즘은 다음과 같은 두 가지 혁신을 포함합니다. 1) 양방향 교차 모드 검증(Bidirectional Cross-modal Verification, BCV) 모듈은 텍스트 기반 쿼리와 이미지 기반 쿼리를 병렬적으로 사용하여 불일치를 명시적으로 식별하는 '심문자' 역할을 수행합니다. 2) 계층적 의미 집계(Hierarchical Semantic Aggregation, HSA) 모듈은 다양한 수준의 불일치 신호를 지능적으로 집계하여 특정 작업에 필요한 추론을 수행합니다. 또한, 미세한 수준의 마스크-라벨 쌍을 추출하기 위해 다양한 마스크-라벨 쌍 추출 파서를 개발했습니다. MaLSF는 DGM4 데이터셋과 다중 모드 가짜 뉴스 탐지 작업에서 최고 수준의 성능을 달성했습니다. 광범위한 분석 및 시각화 결과는 MaLSF의 효과성과 해석 가능성을 더욱 뒷받침합니다.

Original Abstract

As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!