2602.03007v1 Feb 03, 2026 cs.CV

VOILA: 정보 가치 기반의 적응적 해상도 선택을 통한 비용 효율적인 다중 모드 질의 응답

VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering

R. Bhope

Citations: 23

h-index: 2

K. Jayaram

Citations: 18

h-index: 3

Vinod Muthusamy

Citations: 4,126

h-index: 31

Ritesh Kumar

Citations: 15

h-index: 3

Vatche Isahagian

Citations: 3,027

h-index: 25

N. Venkatasubramanian

Citations: 6,585

h-index: 40

대부분의 다중 모드 시각-언어 시스템은 고해상도 시각 정보를 검색하고 처리하는 데 상당한 비용이 들지만, 고정된 해상도 수준에서 작동합니다. 본 연구에서는 시각 질의 응답(VQA)에서 정보 가치 기반의 적응적 해상도 선택을 위한 프레임워크인 VOILA를 소개합니다. VOILA는 모델 실행 전에 어떤 정보를 검색할지를 최적화합니다. VOILA는 질문에 대한 특징을 사용하여 각 해상도에서 정확도 가능성을 예측하는 그래디언트 부스팅 회귀기를 사용하고, 그런 다음 이소토닉 교정기를 사용하여 이러한 확률을 조정하여 신뢰할 수 있는 의사 결정을 내립니다. 시스템은 예측된 정확도와 검색 비용을 고려하여 기대 효용을 극대화하는 최소 비용의 해상도를 선택합니다. 본 연구는 7B-235B 파라미터를 가진 6개의 시각-언어 모델(VLM)을 사용하여 5개의 데이터셋(VQA-v2, GQA, TextVQA, LoCoMo, FloodNet)에 대한 세 가지 배포 시나리오에서 VOILA를 평가했습니다. VOILA는 다양한 질문 유형과 모델 아키텍처에서 전체 해상도 정확도의 90-95%를 유지하면서 일관되게 50-60%의 비용 절감을 달성했습니다. 이는 사전 검색 해상도 선택이 리소스 제약 조건 하에서 다중 모드 추론을 최적화하는 데 매우 중요하다는 것을 보여줍니다.

Original Abstract

Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.

0 Citations

0 Influential

20 Altmetric

100.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!