2602.01538v1 Feb 02, 2026 cs.CV

아바타 상호작용 구현: 텍스트 기반의 인간-객체 상호작용을 통한 제어 가능한 대화형 아바타 개발

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang

Citations: 49

h-index: 3

Zhengguang Zhou

Citations: 160

h-index: 7

Zhentao Yu

Citations: 157

h-index: 5

Ziyao Huang

Citations: 154

h-index: 5

Teng Hu

Citations: 124

h-index: 5

Sen Liang

Citations: 147

h-index: 4

Guozhen Zhang

Citations: 74

h-index: 5

Ziqiao Peng

Citations: 35

h-index: 3

Shunkai Li

Citations: 13

h-index: 2

Yi Chen

Citations: 1,552

h-index: 7

Zixiang Zhou

Citations: 1,560

h-index: 8

Yuan Zhou

Citations: 312

h-index: 10

Qinglin Lu

Citations: 1,857

h-index: 13

Xiu Li

Tencent

Citations: 293

h-index: 7

대화형 아바타 생성은 비디오 생성의 기본적인 과제입니다. 기존 방법들은 간단한 인간 움직임을 기반으로 전체 신체를 움직이는 대화형 아바타를 생성할 수 있지만, 환경 인식을 필요로 하고 제어 품질의 어려움을 내포하는 인간-객체 상호작용(GHOI)으로 확장하는 것은 여전히 해결해야 할 과제입니다. 이러한 과제를 해결하기 위해, 우리는 환경 인식과 계획을 비디오 합성 과정에서 분리하는 새로운 듀얼 스트림 프레임워크인 InteractAvatar를 제안합니다. 객체 감지를 활용하여 환경 인식을 향상시키고, 텍스트에 맞춰 상호작용 동작을 생성하는 Perception and Interaction Module (PIM)을 도입했습니다. 또한, 오디오와 상호작용 정보를 고려하여 생생한 대화형 아바타의 객체 상호작용을 생성하는 Audio-Interaction Aware Generation Module (AIM)을 제안합니다. 특별히 설계된 모션-비디오 정렬기를 통해 PIM과 AIM은 유사한 네트워크 구조를 공유하며, 모션과 비디오를 동시에 생성하여 제어 품질 문제를 효과적으로 완화합니다. 마지막으로, 인간-객체 상호작용 비디오 생성 평가를 위한 GroundedInter라는 새로운 벤치마크를 구축했습니다. 광범위한 실험과 비교를 통해, 제안하는 방법이 대화형 아바타의 인간-객체 상호작용 생성에 효과적임을 입증했습니다. 프로젝트 페이지: https://interactavatar.github.io

Original Abstract

Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!