2601.06106v1 Jan 03, 2026 cs.LG

대규모 멀티모달 벤치마크를 위한 평가 모델

Judge Model for Large-scale Multimodality Benchmarks

Yuhao Wu

Citations: 236

h-index: 6

Min-Han Shih

Citations: 18

h-index: 2

Yu-Wei Chen

Citations: 30

h-index: 2

본 논문에서는 다양한 작업에 걸쳐 신뢰성 있고 설명 가능한 평가를 제공하도록 설계된 전용 멀티모달 평가 모델을 제안합니다. 우리의 벤치마크는 텍스트, 오디오, 이미지 및 비디오 모달리티를 포괄하며, 재현성을 보장하고 학습-검증 데이터 누수를 최소화하기 위해 신중하게 샘플링된 공개 데이터셋을 사용합니다. 우리의 프레임워크는 단순한 점수 부여를 넘어 멀티모달 판단을 종합하고, 모델 출력의 품질과 추론 일관성을 분석하며, 진단 피드백을 생성합니다. 우리는 Gemini 2.5, Phi 4, 및 Qwen 2.5를 포함한 여러 멀티모달 대규모 언어 모델(MLLM)을 280개의 멀티모달 샘플에 대해 평가하고, 평가 모델의 결과와 인간 어노테이터의 평가를 비교합니다. 결과는 평가 모델과 인간 점수 간의 높은 일치성을 보여주며, 이는 향후 멀티모달 AI 연구를 위한 확장 가능하고 해석 가능한 평가 파이프라인으로서의 잠재력을 입증합니다.

Original Abstract

We propose a dedicated multimodal Judge Model designed to provide reliable, explainable evaluation across a diverse suite of tasks. Our benchmark spans text, audio, image, and video modalities, drawing from carefully sampled public datasets with fixed seeds to ensure reproducibility and minimize train test leakage. Instead of simple scoring, our framework aggregates multimodal judgments, analyzes the quality and reasoning consistency of model outputs, and generates diagnostic feedback. We evaluate several MLLMs, including Gemini 2.5, Phi 4, and Qwen 2.5, across 280 multimodal samples and compare judge model assessments with human annotators. Results show strong alignment between the Judge Model and human scores, demonstrating its potential as a scalable, interpretable evaluation pipeline for future multimodal AI research.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!