2601.22725v1 Jan 30, 2026 cs.CV

OpenVTON-Bench: 제어 가능한 가상 착용 평가를 위한 대규모 고해상도 벤치마크

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

Tao Chen

Citations: 22

h-index: 2

Weijie Wang

Citations: 271

h-index: 4

Jin Li

Citations: 105

h-index: 5

Shuai Jiang

Citations: 85

h-index: 5

Jing Luo

Citations: 114

h-index: 3

Chenhui Wu

Citations: 1

h-index: 1

최근 디퓨전 모델의 발전으로 가상 착용(VTON) 시스템의 시각적 품질이 크게 향상되었지만, 신뢰성 있는 평가는 여전히 중요한 과제입니다. 기존의 평가지표는 미세한 텍스처 세부 정보와 의미적 일관성을 정량화하는 데 어려움을 겪고 있으며, 기존 데이터셋은 규모와 다양성 측면에서 상업적 기준을 충족하지 못합니다. 본 논문에서는 약 10만 개의 고해상도 이미지 쌍(최대 1536x1536 픽셀)으로 구성된 대규모 벤치마크인 OpenVTON-Bench를 제안합니다. 이 데이터셋은 DINOv3 기반의 계층적 클러스터링을 사용하여 의미적으로 균형 잡힌 샘플링을 수행하고, Gemini 기반의 밀집 캡셔닝을 사용하여 20개의 미세한 의류 카테고리에 걸쳐 균일한 분포를 보장합니다. 신뢰성 있는 평가를 지원하기 위해, 배경 일관성, 신원 충실도, 텍스처 충실도, 형태적 타당성 및 전반적인 현실감의 다섯 가지 해석 가능한 측면을 측정하는 다중 모드 프로토콜을 제안합니다. 이 프로토콜은 VLM 기반의 의미 추론과 SAM3 분할 및 형태학적 침식을 기반으로 하는 새로운 다중 스케일 표현 지표를 통합하여, 경계 정렬 오류와 내부 텍스처 결함을 분리할 수 있도록 합니다. 실험 결과는 인간의 판단과 높은 상관 관계를 보였으며 (Kendall's τ 값은 0.833, SSIM 값은 0.611), 이는 VTON 평가를 위한 견고한 벤치마크를 구축했음을 나타냅니다.

Original Abstract

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!