2602.15915v1 Feb 17, 2026 cs.CV

MaS-VQA: 지식 기반 시각 질의응답을 위한 마스크 및 선택 프레임워크

MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Kai Ye

Citations: 9

h-index: 2

Xianwei Mao

Citations: 4

h-index: 1

Sheng Zhou

Citations: 11

h-index: 2

Haikuan Huang

Citations: 224

h-index: 5

Bin Li

Citations: 7

h-index: 2

Jiajun Bu

Citations: 82

h-index: 4

Nan Zhang

Citations: 2

h-index: 1

지식 기반 시각 질의응답(KB-VQA)은 모델이 시각 정보를 외부 지식과 통합하여 질문에 답변하도록 요구합니다. 그러나 검색된 지식은 종종 노이즈가 많거나, 부분적으로 관련이 없거나, 시각 콘텐츠와 일치하지 않는 경우가 많습니다. 또한, 내부 모델의 지식은 제어하고 해석하기 어렵습니다. 이러한 정보들을 단순히 결합하는 방식은 추론 효과를 제한하고 답변 정확도를 떨어뜨립니다. 이러한 문제를 해결하기 위해, 우리는 명시적 지식 필터링과 암시적 지식 추론을 밀접하게 결합하는 선택 기반 프레임워크인 MaS-VQA를 제안합니다. MaS-VQA는 먼저 후보 텍스트를 검색하고, 마스크 및 선택 메커니즘을 적용하여 관련 없는 이미지 영역과 약하게 관련 있는 지식 조각을 동시에 제거하여, 간결하고 신호가 높은 다중 모달 지식을 생성합니다. 이렇게 필터링된 지식은 제약된 의미 공간에서 내부 지식의 활성화를 유도하여, 명시적 및 암시적 지식을 상호 보완적으로 모델링하여 강력한 답변 예측을 가능하게 합니다. Encyclopedic-VQA 및 InfoSeek 데이터셋에 대한 실험 결과, 다양한 대규모 언어 모델(MLLM) 백본에서 일관된 성능 향상을 보였으며, 추가 실험을 통해 선택 메커니즘이 노이즈를 효과적으로 줄이고 지식 활용을 향상시키는 것을 확인했습니다.

Original Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!