2603.26859v1 Mar 27, 2026 cs.CV

텍스트 지식의 한계를 넘어: 시각-언어 탐색 성능 향상을 위한 다중 모드 지식 베이스 활용

Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation

Yinfeng Yu

Xinjiang University

Citations: 237

h-index: 10

Dongsheng Yang

Citations: 9

h-index: 2

Liejun Wang

Citations: 27

h-index: 3

시각-언어 탐색(VLN)은 에이전트가 자연어 지시사항에 따라 복잡하고 새로운 환경을 탐색하는 것을 요구합니다. 그러나 기존 방법은 종종 핵심 의미 단서를 효과적으로 파악하고 이를 시각적 관찰과 정확하게 연결하는 데 어려움을 겪습니다. 이러한 제한점을 해결하기 위해, 본 논문에서는 환경별 텍스트 지식과 생성형 이미지 지식 베이스를 통합적으로 활용하는 VLN 프레임워크인 Beyond Textual Knowledge (BTK)를 제안합니다. BTK는 Qwen3-4B를 사용하여 목표 관련 구문을 추출하고, Flux-Schnell을 사용하여 R2R-GP와 REVERIE-GP라는 두 개의 대규모 이미지 지식 베이스를 구축합니다. 또한, BLIP-2를 활용하여 파노라마 뷰에서 추출된 텍스트 지식 베이스를 구축하여 환경별 의미 단서를 제공합니다. 이러한 다중 모드 지식 베이스는 Goal-Aware Augmentor와 Knowledge Augmentor를 통해 효과적으로 통합되어 의미 연결성과 모달 간 정렬을 크게 향상시킵니다. R2R 데이터셋의 7,189개 경로와 REVERIE 데이터셋의 21,702개 지시사항에 대한 광범위한 실험 결과, BTK가 기존 방법보다 현저히 우수한 성능을 보임을 확인했습니다. R2R과 REVERIE 데이터셋의 테스트 데이터에서 각각 SR이 5% 및 2.07% 증가하고, SPL이 4% 및 3.69% 증가했습니다. 소스 코드는 https://github.com/yds3/IPM-BTK/ 에서 확인할 수 있습니다.

Original Abstract

Vision-and-Language Navigation (VLN) requires an agent to navigate through complex unseen environments based on natural language instructions. However, existing methods often struggle to effectively capture key semantic cues and accurately align them with visual observations. To address this limitation, we propose Beyond Textual Knowledge (BTK), a VLN framework that synergistically integrates environment-specific textual knowledge with generative image knowledge bases. BTK employs Qwen3-4B to extract goal-related phrases and utilizes Flux-Schnell to construct two large-scale image knowledge bases: R2R-GP and REVERIE-GP. Additionally, we leverage BLIP-2 to construct a large-scale textual knowledge base derived from panoramic views, providing environment-specific semantic cues. These multimodal knowledge bases are effectively integrated via the Goal-Aware Augmentor and Knowledge Augmentor, significantly enhancing semantic grounding and cross-modal alignment. Extensive experiments on the R2R dataset with 7,189 trajectories and the REVERIE dataset with 21,702 instructions demonstrate that BTK significantly outperforms existing baselines. On the test unseen splits of R2R and REVERIE, SR increased by 5% and 2.07% respectively, and SPL increased by 4% and 3.69% respectively. The source code is available at https://github.com/yds3/IPM-BTK/.

0 Citations

0 Influential

25 Altmetric

125.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!