2601.01513v2 Jan 04, 2026 cs.CV

FastV-RAG: 검색 증강 생성(Retrieval-Augmented Generation)을 활용한 빠르고 세밀한 비디오 질의응답을 향하여

FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation

Citations: 4,937

h-index: 10

Citations: 161

h-index: 5

비전-언어 모델(VLMs)은 시각적 추론에 뛰어난 성능을 보이지만, 여전히 외부 지식을 통합하는 데 어려움을 겪고 있습니다. 검색 증강 생성(RAG)은 유망한 해결책이지만, 현재 방법은 효율성이 낮고 종종 높은 답변 품질을 유지하지 못합니다. 이러한 문제점을 해결하기 위해, 본 논문에서는 두 가지 핵심 아이디어를 기반으로 효율적인 VLM 기반 RAG 프레임워크인 VideoSpeculateRAG을 제안합니다. 첫째, 가벼운 초안 모델이 여러 답변 후보를 빠르게 생성하는 추론 파이프라인을 도입하여, 정확성을 희생하지 않고 추론 지연 시간을 크게 줄입니다. 보다 정확한 모델이 이를 검증하고 개선합니다. 둘째, 검색된 지식 내의 잘못된 개체 인식 오류를 주요 원인으로 파악하고, 간단하지만 효과적인 유사성 기반 필터링 전략을 사용하여 개체 정렬을 개선하고 전체 답변 정확도를 향상시킵니다. 실험 결과, VideoSpeculateRAG은 표준 RAG 접근 방식과 비교하여 동등하거나 더 높은 정확도를 달성하면서 추론 속도를 약 2배 가량 향상시키는 것을 보여줍니다. 본 프레임워크는 추론 파이프라인과 검색 증강 추론을 결합하여 복잡하고 지식 집약적인 다중 모드 작업에서 효율성과 신뢰성을 향상시킬 수 있는 잠재력을 보여줍니다.

Original Abstract

Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!