2604.11103v1 Apr 13, 2026 cs.SD

ActorMind: 인간 배우의 추론을 모방하여 음성 역할극 구현

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

Yike Guo

Citations: 2

h-index: 1

Wei Xue

Citations: 16

h-index: 1

Xi Chen

Citations: 193

h-index: 5

역할극은 인간-기계 상호작용의 기반을 제공하고 사회학적 연구를 촉진하는 강력한 도구로, 최근 주목받고 있습니다. 그러나 현재 연구는 텍스트 기반에만 국한되어 있으며, 일상생활에서 중요한 역할을 하는 음성을 고려하지 않아 진정한 역할극을 구현하는 데 한계가 있습니다. 이러한 격차를 해소하기 위해, 우리는 ActorMindBench를 통해 음성 역할극을 개념화하고 벤치마킹하며, ActorMind라는 해당 추론 프레임워크를 제시합니다. 구체적으로, (1) 음성 역할극은 모델이 역할, 장면, 그리고 구어 대화에 기반하여 개인화된 언어적 특징을 가진 자연스러운 응답을 생성할 수 있도록 합니다. (2) ActorMindBench는 발화 수준(7,653개 발화), 장면 수준(313개 장면), 역할 수준(6개 역할)의 계층적 구조를 가진 벤치마크입니다. (3) ActorMind는 인간 배우가 연극에서 수행하는 방식을 모방하는, 즉석에서 사용할 수 있는 멀티 에이전트 기반의 연쇄적 추론 프레임워크입니다. 구체적으로, ActorMind는 먼저 Eye Agent를 통해 할당된 역할 설명을 읽고, Ear Agent를 통해 맥락 내의 구어 대화에서 감정적 단서를 이해합니다. 그 후, Brain Agent는 설명적인 감정 상태를 생성하고, 마지막으로 Mouth Agent는 해당 감정 상태가 반영된 대본을 전달합니다. 실험 결과는 ActorMind가 음성 역할극을 향상시키는 데 효과적임을 보여줍니다.

Original Abstract

Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!