2602.07434v1 Feb 07, 2026 cs.RO

음성, 감정, 동작을 융합하는 VLM 기반의 다중 모드 엣지 배포형 프레임워크: 휴머노이드 로봇을 위한 제안

Bridging Speech, Emotion, and Motion: a VLM-based Multimodal Edge-deployable Framework for Humanoid Robots

Songhua Yang

Citations: 3

h-index: 1

Xuetao Li

Citations: 8

h-index: 2

Xuan Fei

Citations: 21

h-index: 3

Mengde Li

Citations: 22

h-index: 3

Miao Li

Citations: 1

h-index: 1

효과적인 인간-로봇 상호작용은 풍부한 감정을 담은 다중 모드 표현을 요구하지만, 대부분의 휴머노이드 로봇은 조화로운 음성, 표정, 제스처를 제공하지 못합니다. 또한, 실제 환경에서의 활용은 지속적인 클라우드 연결 없이 자율적으로 작동할 수 있는 온디바이스 솔루션을 필요로 합니다. 본 연구에서는 음성(Speech), 감정(Emotion), 동작(Motion)을 융합하기 위해, 비전-언어 모델(VLM) 기반의 프레임워크인 *SeM$^2$*를 제안합니다. 이 프레임워크는 세 가지 핵심 구성 요소를 통해 감정적으로 일관된 다중 모드 상호작용을 조율합니다. 첫째, 사용자 상황 단서를 포착하는 다중 모드 인식 모듈, 둘째, 응답 계획을 위한 Chain-of-Thought 추론, 셋째, 구어 콘텐츠와 신체 표현 간의 정확한 시간적 조화를 보장하는 새로운 의미-시퀀스 정렬 메커니즘(SSAM)입니다. 우리는 클라우드 기반 버전과 엣지 배포 버전(*SeM$^2_e$*)을 모두 구현했으며, 후자는 지식을 증류하여 엣지 하드웨어에서 효율적으로 작동하도록 설계되었으며, 상대적인 성능의 95%를 유지합니다. 종합적인 평가 결과, 제안하는 방법은 자연스러움, 감정 명확성, 모드 일관성 측면에서 기존 단일 모드 모델보다 훨씬 우수한 성능을 보이며, 다양한 실제 환경에서 사회적으로 표현력이 풍부한 휴머노이드 로봇 기술 발전에 기여합니다.

Original Abstract

Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM$^2$}, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed versions (\textit{SeM$^2_e$}), with the latter knowledge distilled to operate efficiently on edge hardware while maintaining 95\% of the relative performance. Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence, advancing socially expressive humanoid robotics for diverse real-world environments.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!