2604.07823v1 Apr 09, 2026 cs.CV

LPM 1.0: 비디오 기반 캐릭터 성능 모델

LPM 1.0: Video-based Character Performance Model

Ailing Zeng

Citations: 485

h-index: 7

Casper Yang

Citations: 0

h-index: 0

Chauncey Ge

Citations: 0

h-index: 0

Eddie Zhang

Citations: 3

h-index: 1

Garvey Xu

Citations: 0

h-index: 0

Gavin Lin

Citations: 0

h-index: 0

Gilbert Gu

Citations: 0

h-index: 0

Jeremy Pi

Citations: 0

h-index: 0

Mingyi Shi

Citations: 110

h-index: 4

Sheng Bi

Citations: 91

h-index: 3

Steven Tang

Citations: 190

h-index: 6

Thorn Hang

Citations: 0

h-index: 0

Vincent Li

Citations: 15

h-index: 1

Xin Tong

Citations: 37

h-index: 2

Yikang Li

Citations: 154

h-index: 5

Yuchen Sun

Citations: 23

h-index: 2

Yue Zhao

Citations: 36

h-index: 2

Yuwei Li

Citations: 108

h-index: 2

Zan Zhang

Citations: 19

h-index: 2

Zeshi Yang

Citations: 350

h-index: 11

Yuhang Lu

Citations: 36

h-index: 3

Shawn Wang

Citations: 13

h-index: 2

Le Li

Citations: 45

h-index: 4

T. Guo

Citations: 18

h-index: 2

Zi Ye

Citations: 14

h-index: 1

캐릭터의 생동감은 시각적, 음성적, 시간적 행동을 통해 의도, 감정, 개성을 표현하는 '성능'에서 비롯됩니다. 비디오를 통해 이러한 성능을 학습하는 것은 기존의 3D 파이프라인에 대한 유망한 대안입니다. 그러나 기존의 비디오 모델은 높은 표현력, 실시간 추론, 그리고 장기적인 정체성 유지라는 세 가지 목표를 동시에 달성하는 데 어려움을 겪으며, 이를 우리는 '성능 삼각 현상'이라고 부릅니다. 대화는 캐릭터가 동시에 말하고, 듣고, 반응하며, 감정을 표현하면서 정체성을 유지하는 가장 포괄적인 성능 시나리오입니다. 이러한 문제를 해결하기 위해, 우리는 단일 인물의 양방향 오디오-비디오 대화 성능에 초점을 맞춘 LPM 1.0 (Large Performance Model, 대규모 성능 모델)을 제안합니다. 구체적으로, 우리는 엄격한 필터링, 발화-청취 오디오-비디오 페어링, 성능 이해, 그리고 정체성 인지 멀티 레퍼런스 추출을 통해 다중 모드 중심의 인간 데이터셋을 구축하고, 170억 개의 파라미터를 가진 Diffusion Transformer (Base LPM)를 훈련하여 다중 모드 조건을 통해 높은 제어력과 정체성 일관성을 갖는 성능을 구현하며, 이를 저지연, 무한 길이의 상호 작용을 위한 인과적 스트리밍 생성기로 증류합니다. 추론 시, LPM 1.0은 정체성을 인지하는 캐릭터 이미지와 함께 사용자 오디오로부터 듣는 동작 비디오를, 그리고 합성된 오디오로부터 말하는 동작 비디오를 실시간으로 생성하며, 텍스트 프롬프트를 통해 동작을 제어하고, 정체성을 유지하며 무한 길이의 생성을 가능하게 합니다. 따라서 LPM 1.0은 대화형 에이전트, 실시간 스트리밍 캐릭터, 그리고 게임 내 NPC를 위한 시각적 엔진으로 활용될 수 있습니다. 이러한 설정을 체계적으로 평가하기 위해, 우리는 대화형 캐릭터 성능을 위한 최초의 벤치마크인 LPM-Bench를 제안합니다. LPM 1.0은 모든 평가 지표에서 최첨단 결과를 달성하면서도 실시간 추론을 유지합니다.

Original Abstract

Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!