2603.11698v1 Mar 12, 2026 cs.CV

OSCBench: 텍스트-비디오 생성 모델의 객체 상태 변화 성능 평가

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Shi-Min Hu

Citations: 204

h-index: 7

Bin Zhu

Citations: 105

h-index: 6

Franklin Mingzhe Li

Citations: 73

h-index: 2

Patrick Carrington

Citations: 189

h-index: 7

Roger Zimmermann

Citations: 69

h-index: 5

Xia Han

Citations: 6

h-index: 1

Jingjing Chen

Citations: 6,037

h-index: 37

텍스트-비디오(T2V) 생성 모델은 시각적으로 고품질이며 시간적으로 일관성 있는 비디오를 생성하는 데 빠르게 발전해 왔습니다. 그러나 기존의 벤치마크는 주로 인식적 품질, 텍스트-비디오 정렬 또는 물리적 타당성에 초점을 맞추고 있으며, 텍스트 프롬프트에 명시적으로 지정된 객체 상태 변화(OSC)라는 중요한 측면은 아직 충분히 탐구되지 않았습니다. OSC는 감자 껍질 벗기기 또는 레몬 썰기와 같은 동작에 의해 유발되는 객체의 상태 변화를 의미합니다. 본 논문에서는 T2V 모델의 OSC 성능을 평가하기 위해 특별히 설계된 벤치마크인 OSCBench를 소개합니다. OSCBench는 교육용 요리 데이터를 기반으로 구축되었으며, 객체-동작 상호 작용을 정규, 새로운, 그리고 복합적인 시나리오로 체계적으로 구성하여 모델의 일반적인 성능과 일반화 능력을 모두 평가합니다. 본 연구에서는 인간 사용자 평가와 멀티모달 대규모 언어 모델(MLLM) 기반 자동 평가를 사용하여 6개의 대표적인 오픈 소스 및 독점 T2V 모델을 평가했습니다. 결과는 현재 T2V 모델이 의미론적 및 장면 정렬 측면에서는 강력한 성능을 보이지만, 특히 새로운 및 복합적인 환경에서 정확하고 시간적으로 일관된 객체 상태 변화를 구현하는 데 지속적으로 어려움을 겪는다는 것을 보여줍니다. 이러한 결과는 OSC가 텍스트-비디오 생성의 주요 병목 현상임을 시사하며, OSCBench를 상태 인지 비디오 생성 모델을 발전시키는 데 활용할 수 있는 진단 벤치마크로 확립합니다.

Original Abstract

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

1 Citations

0 Influential

18.5 Altmetric

93.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!