2604.04969v1 Apr 04, 2026 cs.IR

MG$^2$-RAG: 다중 양자화 그래프를 활용한 다중 모드 검색 증강 생성 모델

MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

Jun Yu

Citations: 5

h-index: 2

Qiang Huang

Harbin Institute of Technology (Shenzhen)

Citations: 792

h-index: 14

Xiaoxing You

Citations: 26

h-index: 3

Sijun Dai

Citations: 2

h-index: 1

검색 증강 생성(RAG)은 다중 모드 대규모 언어 모델(MLLM)의 환각 현상을 완화하는 데 기여하지만, 기존 시스템은 복잡한 모드 간 추론에 어려움을 겪고 있습니다. 평탄한 벡터 검색은 구조적 의존성을 종종 무시하며, 현재의 그래프 기반 방법은 세부적인 시각 정보를 버리는 비용이 많이 드는 '텍스트 변환' 파이프라인에 의존합니다. 이러한 한계를 해결하기 위해, 본 논문에서는 그래프 구축, 모드 융합, 그리고 모드 간 검색을 동시에 개선하는 경량화된 다중 양자화 그래프 RAG 프레임워크인 **MG$^2$-RAG**를 제안합니다. MG$^2$-RAG는 경량화된 텍스트 파싱과 객체 기반 시각적 연결을 결합하여 계층적인 다중 모드 지식 그래프를 구축하며, 이를 통해 텍스트 객체와 시각적 영역을 통합된 다중 모드 노드로 융합하여 원자 수준의 증거를 보존합니다. 이러한 표현을 바탕으로, 우리는 밀집된 유사성을 집계하고 그래프 전체에 관련성을 전파하는 다중 양자화 그래프 검색 메커니즘을 도입하여 구조화된 다중 단계 추론을 지원합니다. 네 가지 대표적인 다중 모드 작업(검색, 지식 기반 시각 질의 응답, 추론, 분류)에 대한 광범위한 실험 결과, MG$^2$-RAG는 지속적으로 최첨단 성능을 달성하며, 고급 그래프 기반 프레임워크에 비해 평균 43.3배의 속도 향상과 23.9배의 비용 절감 효과를 보였습니다.

Original Abstract

Retrieval-Augmented Generation (RAG) mitigates hallucinations in Multimodal Large Language Models (MLLMs), yet existing systems struggle with complex cross-modal reasoning. Flat vector retrieval often ignores structural dependencies, while current graph-based methods rely on costly ``translation-to-text'' pipelines that discard fine-grained visual information. To address these limitations, we propose \textbf{MG$^2$-RAG}, a lightweight \textbf{M}ulti-\textbf{G}ranularity \textbf{G}raph \textbf{RAG} framework that jointly improves graph construction, modality fusion, and cross-modal retrieval. MG$^2$-RAG constructs a hierarchical multimodal knowledge graph by combining lightweight textual parsing with entity-driven visual grounding, enabling textual entities and visual regions to be fused into unified multimodal nodes that preserve atomic evidence. Building on this representation, we introduce a multi-granularity graph retrieval mechanism that aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. Extensive experiments across four representative multimodal tasks (i.e., retrieval, knowledge-based VQA, reasoning, and classification) demonstrate that MG$^2$-RAG consistently achieves state-of-the-art performance while reducing graph construction overhead with an average 43.3$\times$ speedup and 23.9$\times$ cost reduction compared with advanced graph-based frameworks.

2 Citations

0 Influential

7 Altmetric

37.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!