2601.22709v2 Jan 30, 2026 cs.CV

신뢰도 기반 증류를 통한 게이트된 관계 정렬: 효율적인 비전-언어 모델을 위한 방법

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

A. Habibian

Citations: 1,285

h-index: 18

Yanlong Chen

Citations: 19

h-index: 3

Luca Benini

Citations: 180

h-index: 8

Yawei Li

ETH Zürich

Citations: 5,840

h-index: 32

비전-언어 모델(VLMs)은 뛰어난 다중 모드 성능을 제공하지만, 배포 비용이 높고, 사후 훈련 양자화는 종종 상당한 정확도 손실을 초래합니다. 양자화 인지 훈련은 VLMs에 대한 잠재력을 가지고 있지만, 아직 충분히 연구되지 않았습니다. 본 논문에서는 정보 병목 원리를 기반으로 지식 증류와 양자화 인식 훈련(QAT)을 통합하는 GRACE 프레임워크를 제안합니다. 양자화는 정보 용량을 제한하고, 증류는 이 제한된 용도 내에서 보존해야 할 정보를 안내합니다. 교사 모델을 작업 관련 정보의 대리 모델로 간주하고, 신뢰할 수 없는 감독 신호를 필터링하기 위한 신뢰도 기반 게이트 분리 증류, 시각적 토큰 구조를 전송하기 위한 관계 중심 커널 정렬, 그리고 충실도와 용량 제약 간의 균형을 맞추는 적응형 제어기를 Lagrangian 이완을 통해 도입했습니다. LLaVA 및 Qwen 계열에 대한 광범위한 벤치마크에서, INT4 모델은 일관되게 FP16 기준 모델보다 우수한 성능을 보였습니다(예: LLaVA-1.5-7B: SQA에서 70.1 vs. 66.8; Qwen2-VL-2B: MMBench에서 76.9 vs. 72.6). 실제 INT4 커널을 사용하여 3배의 처리량과 54%의 메모리 감소를 달성했습니다. 본 연구는 기존 양자화 방법보다 훨씬 우수한 성능을 제공하며, GRACE는 리소스 제약 환경에서의 배포를 위한 매력적인 솔루션입니다.

Original Abstract

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

5 Citations

0 Influential

16 Altmetric

85.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!