2605.15199v1 May 14, 2026 cs.CV

EntityBench: 개체 일관성을 갖춘 장거리 멀티샷 비디오 생성 연구

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Ziyan Yang

Citations: 559

h-index: 12

Ruozhen He

Citations: 49

h-index: 3

Meng Wei

Citations: 173

h-index: 4

Vicente Ordonez

Citations: 57

h-index: 4

멀티샷 비디오 생성은 단일샷 생성을 확장하여 일관성 있는 시각적 내러티브를 제공하지만, 긴 시퀀스에서 등장인물, 객체 및 위치의 일관성을 유지하는 것은 여전히 어려운 과제입니다. 기존 평가 방법은 일반적으로 독립적으로 생성된 프롬프트 세트를 사용하며, 이러한 프롬프트는 개체 커버리지가 제한적이고 단순한 일관성 지표를 사용하므로 표준화된 비교가 어렵습니다. 본 연구에서는 실제 내러티브 미디어에서 파생된 140개의 에피소드(2,491개의 샷)로 구성된 벤치마크인 EntityBench를 소개합니다. EntityBench는 각 샷별로 등장인물, 객체 및 위치를 추적하는 명시적인 개체 스케줄을 포함하며, 난이도가 쉬움/보통/어려움인 최대 50개 샷, 13개의 샷 간 등장인물, 8개의 샷 간 위치, 22개의 샷 간 객체, 그리고 최대 48개의 샷 간 반복 간격을 포함합니다. 또한, 본 연구는 샷 내부 품질, 프롬프트 준수성 및 샷 간 일관성을 분리하는 세 가지 평가 시스템과 함께 제공되며, 정확한 개체 표현만 샷 간 점수에 포함되도록 하는 신뢰도 필터(fidelity gate)를 사용합니다. 기준 모델로, 본 연구에서는 검증된 개체별 시각적 참조를 지속적인 메모리 뱅크에 저장하여 생성을 수행하는 메모리 기반 생성 시스템인 EntityMem을 제안합니다. 실험 결과, 기존 방법에서 샷 간 개체 일관성은 반복 거리가 멀어질수록 급격히 저하되는 것으로 나타났으며, 명시적인 개체별 메모리를 사용하면 평가된 방법 중 가장 높은 등장인물 충실도(Cohen's d = +2.33)와 존재감을 보이는 것으로 나타났습니다. 코드 및 데이터는 https://github.com/Catherine-R-He/EntityBench/ 에서 확인할 수 있습니다.

Original Abstract

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

1 Citations

0 Influential

34.047189562171 Altmetric

171.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!